merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
433 stars 144 forks source link

[BUG] no reads passing iu-filter-quality-minoche #2185

Closed pedres closed 9 months ago

pedres commented 10 months ago

Short description of the problem

No sequence passes the filter step of iu-filter-quality-minoche

anvi'o version

Keep the header of this section, but replace this text with the output of this command in your terminal:

anvi-self-test --version

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.13

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

System info

ubuntu 22.04 Installed anvio following intr

Detailed description of the issue

I was filtering some samples before starting the assembly. However, after looking at how was the process was going on I noticed that the QUALITY_PASSED fastqs were empty. So I chose one sample and tried to quality-filter it with minoche and bokulich options. As expected no read passed the minoche filter, and the majority of them passed it the bokulich default filter and the bokulich "modified as minoche" filter. I tried this last option triying to mimic the minoche filter but I am not sure if it is correct. So I do not know whether my reads are really bad atteding to minoche filtering or there is a problem with this filtering command.

iu-filter-quality-minoche CT.ini --ignore-deflines

number of pairs analyzed : 40795034 total pairs passed : 0 (%0.00 of all pairs) total pair_1 trimmed : 0 (%0.00 of all passed pairs) total pair_2 trimmed : 0 (%0.00 of all passed pairs) total pairs failed : 40795034 (%100.00 of all pairs) pairs failed due to pair_1 : 0 (%0.00 of all failed pairs) pairs failed due to pair_2 : 0 (%0.00 of all failed pairs) pairs failed due to both : 40795034 (%100.00 of all failed pairs)

iu-filter-quality-bokulich CT.ini --ignore-deflines

number of pairs analyzed : 40795034 total pairs passed : 35528514 (%87.09 of all pairs) total pair_1 trimmed : 0 (%0.00 of all passed pairs) total pair_2 trimmed : 0 (%0.00 of all passed pairs) total pairs failed : 5266520 (%12.91 of all pairs) pairs failed due to pair_1 : 845044 (%16.05 of all failed pairs) pairs failed due to pair_2 : 2561539 (%48.64 of all failed pairs) pairs failed due to both : 1859937 (%35.32 of all failed pairs)

iu-filter-quality-bokulich -q 0 -r 0 CT.ini --ignore-deflines

number of pairs analyzed : 40795034 total pairs passed : 40731724 (%99.84 of all pairs) total pair_1 trimmed : 0 (%0.00 of all passed pairs) total pair_2 trimmed : 0 (%0.00 of all passed pairs) total pairs failed : 63310 (%0.16 of all pairs) pairs failed due to pair_1 : 7427 (%11.73 of all failed pairs) pairs failed due to pair_2 : 49038 (%77.46 of all failed pairs) pairs failed due to both : 6845 (%10.81 of all failed pairs)

Files / commands to reproduce the issue

Having reproducible files can help us a lot to fix an issue. So please consider sharing with us files to reproduce the issue (i.e., a contigs database, a profile database, an input file, a BAM file, etc).

Put them in a single directory, compress the directory, upload it to Dropbox (or any other comparable service), and replace the text here with with a direct download link along with instructions, such as exact command line instructions, on how to reproduce the error.

If you don't have files to share, please remove this section along with the header.

meren commented 10 months ago

This is quite weird indeed. Would you be willing to send me a download link for the first 10,000 lines in your R1 and R2 files for CT?

pedres commented 10 months ago

To get the files I did

head -n 10000 SRR16797976_1.fastq > SRR16797976_RED_1.fastq head -n 10000 SRR16797976_2.fastq > SRR16797976_RED_2.fastq

https://drive.google.com/drive/folders/1eyJ4mW4a0oJZO1c0u_zIWf5gx6G9eft5?usp=sharing

Thanks a lot for the help.

meren commented 9 months ago

Dear @pedres,

Today I run into the problem you describe here in another context, which forced me to look a bit deeper into this problem.

It turns out, for some odd reason, the vast majority of bases that are generated by NovaSeq seem to have a quality score of 30. Q30 used to be seen as the least acceptable quality score, but since now almost all bases are at Q30 (denoted with ?), iu-filter-quality-minoche fails all of them.

I change the expectation from q > 30 to q >= 30, and now all of those reads that were previously being failed by iu-filter-quality-minoche are kept. I'm very sorry for not figuring this out earlier.

I now have made a new release, and you can simply run this in your environment to fix this issue:

pip install --upgrade illumina-utils