mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
203 stars 17 forks source link

Error: unable to gather read lengths for the first input file #65

Closed npdungca closed 10 months ago

npdungca commented 11 months ago

Hi. I'm trying to subsample by depth and I'm getting this error:

./rasusa -i barcode05_duplex.fastq.gz --coverage 400 --genome-size 243724 -s 100 -o BC05_400x.fq.gz [2023-10-02][13:20:16][rasusa][INFO] Target number of bases to subsample to is: 97489600 [2023-10-02][13:20:16][rasusa][INFO] Gathering read lengths... Error: unable to gather read lengths for the first input file

Caused by: 0: Failed to parse record 1: Sequence length is 373 but quality length is 120 (record '5ad0b9e9-94a7-477c-b47f-0963e639d159' at line 1544857)

Thank you for your help.

mbhall88 commented 11 months ago

Sounds like your input fastq might have an invalid record. You can confirm this with seqkit by running seqkit seq barcode05_duplex.fastq.gz > /dev/null

The error message from rasusa tells you the read id of the read that causes the error 5ad0b9e9-94a7-477c-b47f-0963e639d159 on line 1544857. You could also run wc -l on the (decompressed) fastq and if the number of lines is 1544857 (or thereabouts) then it might be that the last read in the file got truncated?

npdungca commented 10 months ago

Got it. It seems that the last line got trucated. Thank you so much for patiently answering my queries. Screenshot from 2023-10-22 16-28-00