Open sjanssen2 opened 5 years ago
The offset of having the primer still in my Deblur feature table did bias this analysis a lot :-/
This is interesting, do you have examples of this? How much is alot or what was the bias?
Thanks!
Regarding the 11-mix from the paper: The difference was none. We could not find any feature sequence from their genomes in our study. However, for an extended list of 40 genomes, we found 4 matching sequences in our study when primers were trimmed away, compared to 0. However, the 4 matching features are quite unspecific for the taxon / strain (thousands of 100%id matches when blasting against NR) with hundreds of different taxonomyIDs. However, different primer absence / presence can ruin promising search engines like red biom.
Has this issue been resolved?
Note that I'm seeing some really weird results when I pull down all of the human fecal samples with redbiom. Namely, if I run
ctx="Deblur-Illumina-16S-V4-100nt-fbc5b2"
redbiom search metadata "(feces | fecal | faecal | stool) & (sapien | human)" > human_ids.txt
cat human_ids.txt | redbiom fetch samples --context $ctx --output human-stool-deblur.biom
I'll get a deblurred table with 993083 taxa
The weird part about this is that if I run closed-reference OTU picking with vsearch, I only get 34977 hits (there are 599482 unmatched sequences). I wonder if a lack of primer trimming is inflating the miss rate.
The issue hasn't been resolved. Note that this specific issue is about a warning to users about this tentative problem with wetlab methods that do not ignore the linkerprimer of the sequences.
Anyway, a few questions to try to help you with your specific question:
Right on, thanks for responding then - maybe I should move this over to a new issue? Regarding your question
The command that I used is given as follows
qiime vsearch cluster-features-closed-reference --i-reference-sequences ~/Documents/databases/97_seqs.qza --i-sequences human-stool-deblur-seqs.qza --i-table human-stool-deblur.biom.qza --p-perc-identity .97 --output-dir closed-reference --p-threads 30
No problem! Open a new issue? I guess this is fine; next time, suggest sending an email to qiita.help@gmail.com.
--p-strand both
@mortonjt, is it that 599482 unique features from the input deblur table failed to recruit to Greengenes, or a total of 599482 reads failed to recruit? What's the total % of reads lost following recruitment, and what fraction of the features failing to recruit are singletons or doubletons?
A few considerations:
--p-min-reads 1
so singletons are retained I'm not seeing evidence of a quality control issue or a problem, but rather that there are unknowns. I'm not sure if there is a deviation in expectation as I don't think there has been much investigation (that I'm aware of at least) on the application of closed reference following deblur.
These are the former - 599482 unique features from the deblur table that failed to recruit to Greengenes.
If quality control isn't an issue, then I think this is really exciting! I didn't realize how many novel bugs weren't represented in GG.
The reoccurrence of features across samples is exciting. Do these recruit to greengenes at, say, 90%?
There is a discussion about why to trim or not to trim primers prior to DADA2 or Deblur: https://forum.qiime2.org/t/deblur-vs-dada2-questions/2093/7
As far as I understand the protocol correctly, the primer is designed to align within the conserved region prior / after the variable V4 region and taxonomic information shall be harvested this region.
Furthermore, when comparing sequence features across experiments, we need to ensure we are treating them in the same way. I don't see an option how to remove primers (even if specified in pcr_primers col in prep files) for per_sequence_fastq files. Thus, it is the obligation of the user to ensure sequences are treated correctly.
I myself did not :-/ It took me half a year to realize this and I had to re-analyse three projects. The issue came to my attention, because I was reading https://doi.org/10.1038/s41586-019-0878-z downloaded genomes of those bugs, used primerprospector to obtain V4 reads and checked if I can find those features in my study. The offset of having the primer still in my Deblur feature table did bias this analysis a lot :-/
Therefore, I think Qiita should warn the user of this situation, e.g. by reading the pcr_primer column, comparing the beginning / end of uploaded sequences and test if primer sequences are still contained in say more than 90% of individual reads or of 90% of features in the final feature table.