mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
203 stars 17 forks source link

illumina read input error #60

Closed IndreNav closed 1 year ago

IndreNav commented 1 year ago

Hello, I'm trying to subsample Illumina paired-end data but keep getting an error. I also only get one of the files (sub_RHB02_R1_001.fastq) and it is empty (0 bytes). rasusa version used is 0.6.1 and the command used with the error is below. I'm not sure what exactly is the issue and how to proceed.

$ rasusa -i RHB02_R1_001.fastq.gz RHB02_R2_001.fastq.gz -b 651412654 -s 1 -o sub_RHB02_R1_001.fastq -o sub_RHB02_R2_001.fastq [2022-11-16][13:15:02][rasusa][INFO] Two input files given. Assuming paired Illumina... [2022-11-16][13:15:02][rasusa][INFO] Target number of bases to subsample to is: 651412654 [2022-11-16][13:15:02][rasusa][INFO] Gathering read lengths... [2022-11-16][13:16:51][rasusa][INFO] Gathering read lengths for second input file... [2022-11-16][13:18:19][rasusa][ERROR] First input has 92613254 reads, but the second has 92613254 reads. Paired Illumina files are assumed to have the same number of reads. The results of this subsample may not be as expected now.

mbhall88 commented 1 year ago

Could you please try again with the latest version (0.7.0) and check if the same thing happens?

At the end of the day though, as the error message says, your paired files have different number of reads. The way rasusa works is it (randomly) picks the reads in the first file that get it to half the requested number of bases, and then takes the mate from the other file. To do this, it assumes the reads are in the same order in both files (which is standard). If you have a different number of reads in each file this would suggested your order may also be off.

You could use seqkit pair to remove your unpaired reads.

mbhall88 commented 1 year ago

Closed due to no reply. Feel free to reopen if the problem persists