smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
90 stars 10 forks source link

Data for "Per base sequence quality" section differ from FastQC #25

Open Shelestova-Anastasia opened 2 years ago

Shelestova-Anastasia commented 2 years ago

falco version 0.3.0

Fastqc and falco results are differrent for section "Per base sequence quality"

Per base sequence quality seqtion to compare in results: pbsq_falco.txt pbsq_fastqc.txt

Could not attach sample (25.3 Mb). Let me know if I can provide more information.

Shelestova-Anastasia commented 2 years ago

Now I made small fastq to demonstrate per base quality difference.

Sample: test_quality.fastq.gz

Falco result: falco_fastqc_data.txt

Fastqc result: fastqc_data.txt

See >>Per base sequence quality section, 10-14 group.

falco: 10-14 36.6908 38.0 38.0 38.0 36.0 38.0 fastqc: 10-14 36.11448 37.4 36.6 37.4 34.4 37.4

I see Median, Lower, Quartile, Upper Quartile, 10th Percentile, 90th Percentile are always ints - not doubles. Why the results are different?

guilhermesena1 commented 2 years ago

Hello,

Thank you so much for looking into the differences and my sincere apologies for not reaching out sooner!

We can merge your fix to the main branch if you submit a pull request. The only thing I notice is you are using C notation to cast as double, whereas we use static_cast to encapsulate the casting and not require convoluted parentheses.

In other words the line

y = (double) x;

would be written as

y = static_cast<double>(x);

I can also fix this myself if you submit the PR as-is.

Thank you again for looking into the issue!

Shelestova-Anastasia commented 2 years ago

Hello, @guilhermesena1. I fixed cast. Thank you!

infotmatician commented 1 year ago

Hi, thank you for your effort to fixed this issue @Shelestova-Anastasia @guilhermesena1. I am using Nanopore Sequencing data, and getting the same error in Falco v1.2.1 the per base sequence quality section. The error is the NaN values of FastQC per base sequence quality section differ from the Falco output. I am not sure but Falco assigns the mean values for all Quartiles columns. Is that true? Does falco has threshold for total number of reads on those positions? Best Regards [falco_fastqc_data.txt](https://github.com/smithlabcode/falco/files/9752921/falco_fastqc_data.txt) [fastqc_data.txt](https://github.com/smithlabcode/falco/files/9752922/fastqc_data.txt)

guilhermesena1 commented 1 year ago

Hello,

Thanks for reporting the different output.

I need to look a little further into FastQC's code, but intuitively it doesn't make sense to me that any of the quantiles would be NaNs. If there is at least one read of a given size (or within a size range), then the mean, median, and all quantiles would be well-defined. For example, if there is a single read quality, then all values would be identical. I don't see a case where "NaN" would be the desired outcome.

I'll look into it though.

guilhermesena1 commented 1 year ago

Yeah it looks like FastQC only considers a position to be "valid" if they have at least 100 reads covering that position. I guess we'll have to emulate that functionality as well?

This is the getPercentile() function they use in the module


    private double getPercentile (int minbp, int maxbp, int offset, int percentile) {
        int count = 0;
        double total = 0;

        for (int i=minbp-1;i<maxbp;i++) {
            if (qualityCounts[i].getTotalCount() > 100) {
                count++;
                total += qualityCounts[i].getPercentile(offset, percentile);
            }
        }

        if (count > 0) {
            return total/count;
        }
        return Double.NaN;

    }
infotmatician commented 1 year ago

Thank you for the quick response. I agree with you. We dont need to add. Better idea to show the mean, median, and all quantiles values even if there is a one read. I just did not understand to output of Falco, because all values are the same on some base intervals.

guilhermesena1 commented 1 year ago

It may be a bug, but this is also the correct output if there is only one read within a certain group (very possible in ONT data with high variability of read lengths). If there is only one read of size 50,000, then all quantiles of quality would be the quality value of the 50,000th base of that read, right?

infotmatician commented 1 year ago

Yeah it should be, that's correct you are right. Thanks for the response, and such a great tool