mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
211 stars 17 forks source link

Error: unable to gather read lengths for the first input file #65

Closed npdungca closed 1 year ago

npdungca commented 1 year ago

Hi. I'm trying to subsample by depth and I'm getting this error:

./rasusa -i barcode05_duplex.fastq.gz --coverage 400 --genome-size 243724 -s 100 -o BC05_400x.fq.gz [2023-10-02][13:20:16][rasusa][INFO] Target number of bases to subsample to is: 97489600 [2023-10-02][13:20:16][rasusa][INFO] Gathering read lengths... Error: unable to gather read lengths for the first input file

Caused by: 0: Failed to parse record 1: Sequence length is 373 but quality length is 120 (record '5ad0b9e9-94a7-477c-b47f-0963e639d159' at line 1544857)

Thank you for your help.

mbhall88 commented 1 year ago

Sounds like your input fastq might have an invalid record. You can confirm this with seqkit by running seqkit seq barcode05_duplex.fastq.gz > /dev/null

The error message from rasusa tells you the read id of the read that causes the error 5ad0b9e9-94a7-477c-b47f-0963e639d159 on line 1544857. You could also run wc -l on the (decompressed) fastq and if the number of lines is 1544857 (or thereabouts) then it might be that the last read in the file got truncated?

npdungca commented 1 year ago

Got it. It seems that the last line got trucated. Thank you so much for patiently answering my queries. Screenshot from 2023-10-22 16-28-00