smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
90 stars 10 forks source link

Non-identical length distribution #42

Open schorlton opened 1 year ago

schorlton commented 1 year ago

Same file. Running falco v1.2.1 from bioconda and MultiQC 1.12. Can reproduce by running on nanopore data from SRA with long read lengths.

MultiQC report of FastQC: image

MultiQC report of falco: image

I believe falco calculates length distribution for every length, while FastQC creates a histogram in fastqc_data.txt. Which is better? The granularity and detail is nice, but it can also obscure plotting. Should falco reproduce FastQC behaviour or perform some kind of binning of read lengths? Interested in your thoughts.

andrewdavidsmith commented 1 year ago

@schorlton I personally am not sure which I think is "better". More info is rarely bad. But I'm definitely interested in your opinion on "better" in a general sense. Input is always appreciated! We will definitely consider any suggested change or enhancement.

schorlton commented 1 year ago

Thanks for quick response! I tend to agree that more info is better. However, this is a somewhat breaking change for use with MultiQC (which I expect many falco users also use). What I would possibly suggest is PR to MultiQC to smooth the line or format this plot as a bar graph (basically a very granular histogram) instead of line graph. Hard to tell what it would look like before implemented, and it would need to work with both tools, but It seems the trend is more important than the individual sizes. Definitely between 0 and ~7500bp on the plot above the line is too thick to be useful. Alternative would be to reproduce FastQC behaviour, or something closer to it than bin size of 1 for read length distribution?

andrewdavidsmith commented 1 year ago

We'll see how to take a first stab at this and leave this issue open until we can say something on it.

guilhermesena1 commented 1 year ago

Hello,

When making the sequence length module analysis I had previously made a somewhat executive decision to not group it, because I assume that, in any long read dataset (where this module is often relevant), the number of reads would never generate gigantic bar plots.

That said, this was a bad decision. It's not our call to decide on the behavior of the module, but rather to emulate it faithfully. I'll work on creating base groups for this module. It will be disabled if --nogroup is provided, but I worry a bit that if someone wants to not group length but group, say, sequence content (which generates very large plots), they can't detach one from the other. We might need to add some more falco-specific flags to add this functionality to only group certain modules.

If I may ask: I'm very curious about additional insights that MultiQC provides that is not already available on falco's HTML output? One of our goals in making falco was modernizing the FastQC plots, which I believe is similar to what MultiQC provides. In that spirit, the falco HTML plots for sequence lengths are bar plots, like you suggested (and I fully agree).

Is MultiQC advantageous in this case because you can merge QC metrics for multiple datasets? Or create customized tools for additional summary statistics beyond what FastQC provides?

schorlton commented 1 year ago

Hi @guilhermesena1, sorry for the very late reply and thank you for your input and work on Falco. I look forward to a solution that facilitates integration with MultiQC. MultiQC aggregates reports from many tools which go well beyond read quality control (see MultiQC Modules). The functionality of MultiQC and HTML reports from FastQC/Falco are hard to compare as they are different in aim and scope. Thanks again!