smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
90 stars 10 forks source link

Overrepresented Sequences shows "no hit" for all sequences #17

Closed edgardomortiz closed 2 years ago

edgardomortiz commented 2 years ago

We compared a FastQC run and a falco 0.2.4 (from bioconda) run and the Overrepresented sequences table shows hit names such as "Truseq adaptor XX" for FastQC while all overrepresented sequences are shown as "no hit" for falco.

The result is identical when adding --contaminants and the path to the contaminant list file (this file is not shipped with the conda installation).

Thanks

Edgardo

guilhermesena1 commented 2 years ago

Hi,

I'll address both the issues you created here if that's ok?

Thank you for bringing this up, I think I didn't configure the Conda metadata correctly to have the default adapters and contaminants be downloaded when falco is installed. I'll try to fix this in the next few days. In the meantime, and I apologize for the inconvenience, you can locally download the adapters and contaminants list manually and provide them using the --contaminants and --adapters flag. I apologize for the inconvenience!

That being said, I'm puzzled why the contaminants is not consistent with FastQC in your case. Would you be able to provide the first 40,000 lines of your input FASTQ file and the contaminants file you are using so I can try to reproduce the issue and look into it? Thank you!

edgardomortiz commented 2 years ago

Thanks! Unfortunately the file is in our lab, I will pass by tomorrow to retrieve it.

I was checking the code and I understand the default adapter list is also present within the code, so theoretically falco should be able to find the adaptors in the FASTQ files even when the adaptor list is not available (am I right? I program mostly in python so I might have missed something there).

The contaminant file is the one provided in falco's repository in directory Configuration

Edgardo

edgardomortiz commented 2 years ago

Here I attach the FASTQ files and the reports produced by each program, the contaminant list file was the one supplied in this repository, the commands were:

fastqc --nogroup -o fastqc Anthopterus*
falco --nogroup -o falco_c -c contaminants.txt Anthopterus*

Anthopterus-racemosus_LV16228_R2.fq.gz Anthopterus-racemosus_LV16228_R1.fq.gz Anthopterus-racemosus_LV16228_R1_fastqc.zip Anthopterus-racemosus_LV16228_R2_fastqc.zip falco_c.zip

guilhermesena1 commented 2 years ago

Thank you so much for providing these reports!

Regarding hits in the overrepresented sequences module, I think you just found a feature that needs to be incorporated into falco. I looked at the contaminant and the sequence that FastQC claims to be a contaminant and saw that this is how they overlap:


r1      CATGATCAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCA----------    50
r2      ----------GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC    50
                  ****************************************

so the suffix of the sequence is the prefix of a contaminant. Currently falco only checks if the sequence is contained in the contaminant, so it does not account for these overlaps. This is useful, I will fix this in the upcoming release for sure!

Regarding adapters, it looks like falco is checking for adapters, but it is simply not finding the Illumina universal adapter that FastQC finds. I'll have to look into this one in more detail.

edgardomortiz commented 2 years ago

Cool, Falco does find similar amounts of adaptors as FastQC but only when specifying the adaptor list with --adapters. I could send those report as well if they help to replicate the Issue.

guilhermesena1 commented 2 years ago

I might need the actual reads to reproduce. You are right that falco has hard-coded adapter and contaminant list inside src/FalcoConfig.cpp which are identical to the default files inside Configuration, so in the absence of these files the behaviors should be identical whether or not you pass the --adapters flag. I double checked if the sequences and hashes inside src/FalcoConfig.cpp are the same as in Configuration/adapters.txt and so far I really don't see what could be causing the difference. I'll keep looking.

edgardomortiz commented 2 years ago

Hi again, The reads are the same as for the previous test: https://github.com/smithlabcode/falco/files/7127232/Anthopterus-racemosus_LV16228_R2.fq.gz https://github.com/smithlabcode/falco/files/7127235/Anthopterus-racemosus_LV16228_R1.fq.gz

The FastQC command was:

fastqc --nogroup -o fastqc Anthopterus*

And its results are: https://github.com/smithlabcode/falco/files/7127250/Anthopterus-racemosus_LV16228_R1_fastqc.zip https://github.com/smithlabcode/falco/files/7127252/Anthopterus-racemosus_LV16228_R2_fastqc.zip

The Falco commands were (using the adapter list provided in this repository):

falco --nogroup -o falco_defaults Anthopterus*
falco --nogroup -o falco_a -a adapters.txt Anthopterus*

And the Falco results were: falco_default.zip falco_a.zip

guilhermesena1 commented 2 years ago

thank you so much! I pushed a modification ( f3f6f58 ) of the contaminant identification algorithm that allows partial overlap. In your test case at least it is identifying the truseq contaminants correctly. I still have to test it more thoroughly to see if I haven't retroactively broken anything. Will look into the adapter issue next.

edgardomortiz commented 2 years ago

Thanks again for the quick solution to these issues!

guilhermesena1 commented 2 years ago

My pleasure! Closing for now but feel free to reopen (I'll do that too) if any datasets do not match FastQC or the expected correct answer.