torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Obtaining the expected error for each read #551

Closed timothevanmeter closed 6 months ago

timothevanmeter commented 6 months ago

Hello,

I am comparing two programs for the same task, namely filtering reads based on their expected error. However, with the same threshold, 0.01, the number of discarded reads is quite different: 2 millions more reads discarded with vsearch out of 36 millions total.

I would like to understand what is causing this difference and to do so obtain directly the expected error as calculated by vsearch - I already have this for the other program. Specifically, I would like to have the output for each read used to compare to the fastq_maxee argument.

I found this in the code in filter.cc on line 165:

      /* truncate by quality and expected errors (ee) */
      res.ee = 0.0;
      char * q = fastx_get_quality(h) + res.start;
      for (int64_t i = 0; i < res.length; i++)
        {
          int qual = fastq_get_qual(q[i]);
          double e = exp10(-0.1 * qual);
          res.ee += e;

          if ((qual <= opt_fastq_truncqual) ||
              (res.ee > opt_fastq_truncee))
            {
              res.ee -= e;
              res.length = i;
              break;
            }  

Would it be possible - without having to modify too much the C++ code - to output res.ee for each read from vsearch?

Thanks in advance.

PS: I already made sure that the quality was converted from the right format using the option --fastq_ascii 33, so this is not causing the problem, maybe float precision difference between the two programs?

torognes commented 6 months ago

Hi,

Thanks for looking into this. I will be interested to see your results, especially if you find a problem with vsearch.

I think the option --eeout to the command fastq_filter or fastx_filter will give you the information you need, so no need to modify the code. It will include ;ee=number in the headers of the output file, where number is the expected total error for the sequence.

Example:

$printf "@example\nACGT\n+\nIIII\n" | vsearch --fastx_filter - --eeout --quiet --fastaout -

Result:

>example;ee=0.0004000
ACGT

In this case there were four nucleotides of quality 40 (ascii I), indicating an expected error of 0.0001 each. The sum is 0.0004.

I hope this will help you.

timothevanmeter commented 6 months ago

Hello,

Exactly what I was looking for! Thanks a lot! I thought I read this part of the documentation, but not thoroughly enough it seems...

If I find what is causing the difference between the two programs I will report it back here.

timothevanmeter commented 6 months ago

I compared the expected errors for both programs and even though the precision is different the number of discarded reads at the 0.01 threshold is exactly the same. So it must be a mix-up in the versions of the files I am using ; nothing to do with vsearch.

However, vsearch and some other utilities perform the same task in 35 seconds compared to 9 minutes in python! Thank you for your help!

torognes commented 6 months ago

Thanks, good to know!