Closed edgardomortiz closed 2 years ago
Hi,
I'll address both the issues you created here if that's ok?
Thank you for bringing this up, I think I didn't configure the Conda metadata correctly to have the default adapters and contaminants be downloaded when falco is installed. I'll try to fix this in the next few days. In the meantime, and I apologize for the inconvenience, you can locally download the adapters and contaminants list manually and provide them using the --contaminants
and --adapters
flag. I apologize for the inconvenience!
That being said, I'm puzzled why the contaminants is not consistent with FastQC in your case. Would you be able to provide the first 40,000 lines of your input FASTQ file and the contaminants file you are using so I can try to reproduce the issue and look into it? Thank you!
Thanks! Unfortunately the file is in our lab, I will pass by tomorrow to retrieve it.
I was checking the code and I understand the default adapter list is also present within the code, so theoretically falco
should be able to find the adaptors in the FASTQ files even when the adaptor list is not available (am I right? I program mostly in python so I might have missed something there).
The contaminant file is the one provided in falco
's repository in directory Configuration
Edgardo
Here I attach the FASTQ files and the reports produced by each program, the contaminant list file was the one supplied in this repository, the commands were:
fastqc --nogroup -o fastqc Anthopterus*
falco --nogroup -o falco_c -c contaminants.txt Anthopterus*
Anthopterus-racemosus_LV16228_R2.fq.gz Anthopterus-racemosus_LV16228_R1.fq.gz Anthopterus-racemosus_LV16228_R1_fastqc.zip Anthopterus-racemosus_LV16228_R2_fastqc.zip falco_c.zip
Thank you so much for providing these reports!
Regarding hits in the overrepresented sequences module, I think you just found a feature that needs to be incorporated into falco. I looked at the contaminant and the sequence that FastQC claims to be a contaminant and saw that this is how they overlap:
r1 CATGATCAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCA---------- 50
r2 ----------GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC 50
****************************************
so the suffix of the sequence is the prefix of a contaminant. Currently falco only checks if the sequence is contained in the contaminant, so it does not account for these overlaps. This is useful, I will fix this in the upcoming release for sure!
Regarding adapters, it looks like falco is checking for adapters, but it is simply not finding the Illumina universal adapter that FastQC finds. I'll have to look into this one in more detail.
Cool, Falco does find similar amounts of adaptors as FastQC but only when specifying the adaptor list with --adapters
. I could send those report as well if they help to replicate the Issue.
I might need the actual reads to reproduce. You are right that falco has hard-coded adapter and contaminant list inside src/FalcoConfig.cpp
which are identical to the default files inside Configuration
, so in the absence of these files the behaviors should be identical whether or not you pass the --adapters
flag. I double checked if the sequences and hashes inside src/FalcoConfig.cpp
are the same as in Configuration/adapters.txt
and so far I really don't see what could be causing the difference. I'll keep looking.
Hi again, The reads are the same as for the previous test: https://github.com/smithlabcode/falco/files/7127232/Anthopterus-racemosus_LV16228_R2.fq.gz https://github.com/smithlabcode/falco/files/7127235/Anthopterus-racemosus_LV16228_R1.fq.gz
The FastQC
command was:
fastqc --nogroup -o fastqc Anthopterus*
And its results are: https://github.com/smithlabcode/falco/files/7127250/Anthopterus-racemosus_LV16228_R1_fastqc.zip https://github.com/smithlabcode/falco/files/7127252/Anthopterus-racemosus_LV16228_R2_fastqc.zip
The Falco
commands were (using the adapter list provided in this repository):
falco --nogroup -o falco_defaults Anthopterus*
falco --nogroup -o falco_a -a adapters.txt Anthopterus*
And the Falco
results were:
falco_default.zip
falco_a.zip
thank you so much! I pushed a modification ( f3f6f58 ) of the contaminant identification algorithm that allows partial overlap. In your test case at least it is identifying the truseq contaminants correctly. I still have to test it more thoroughly to see if I haven't retroactively broken anything. Will look into the adapter issue next.
Thanks again for the quick solution to these issues!
My pleasure! Closing for now but feel free to reopen (I'll do that too) if any datasets do not match FastQC or the expected correct answer.
We compared a
FastQC
run and afalco 0.2.4
(frombioconda
) run and the Overrepresented sequences table shows hit names such as "Truseq adaptor XX" forFastQC
while all overrepresented sequences are shown as "no hit" forfalco
.The result is identical when adding
--contaminants
and the path to the contaminant list file (this file is not shipped with theconda
installation).Thanks
Edgardo