mritchielab / restrander

MIT License
3 stars 1 forks source link

Output fastq from restrander is corrupted #3

Closed panariellofrancesco closed 7 months ago

panariellofrancesco commented 7 months ago

Hi!

First and foremost, thank you very much for developing this tool, it has improved a lot my cDNA ONT data analysis.

However, I am encountering some issues with the output fastq of restrander in one of two samples.

The command line that I have used is the following: restrander ${reads} ${idSample}_oriented_reads.fastq.gz /restrander/config/PCB109.json

Where ${reads} is the input fastq. I have run restrander on two samples and, for both of them, logs are the same.

Sample1 log: Up to record 84100000... Up to record 84200000... Up to record 84300000... Up to record 84400000... Up to record 84500000... Up to record 84600000... Finished restranding!

Sample2 log: Up to record 137800000... Up to record 137900000... Up to record 138000000... Up to record 138100000... Up to record 138200000... Up to record 138300000... Finished restranding!

However, I have checked the output fastq of the two samples and I have found that, in Sample2, there are several reads where the length of SEQ is different from the length of QUAL, impeding me to proceed.

The check has been done as follows: zcat Sample2_oriented_reads.fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l

And the output is 12945, instead of the expected 0.

Do you have any idea of why this could be happening and how could I fix it? The fact that the tool has worked just fine for Sample1 should exclude any issue related with the software installation or working in general.

Thank you very much, Francesco

jakob-schuster commented 7 months ago

Hi Francesco,

Thanks for using the software, glad it's useful to you! There's a known bug where, if Restrander encounters a sequence >500kb in length, it silently breaks and the rest of the output is nonsense - maybe that's causing your issue? I've pushed a bug fix, try installing the new version and let me know if that solves it.

panariellofrancesco commented 7 months ago

Hi Jakob,

thank you very much for your quick reply. As you mentioned, the issue was exactly dependent on the length of certain sequences due to the fact that only one of the two samples (the one that was working) was filtered for reads with < Q10 (which can arrive to be 1Mb long). By filtering the other samples (and updating your tool), everything has worked smoothly.

Thank you again for your help!