qiita-spots / qiita

Qiita - A multi-omics databasing effort
https://qiita.ucsd.edu/
BSD 3-Clause "New" or "Revised" License
120 stars 80 forks source link

warn to trim primers #2820

Open sjanssen2 opened 5 years ago

sjanssen2 commented 5 years ago

There is a discussion about why to trim or not to trim primers prior to DADA2 or Deblur: https://forum.qiime2.org/t/deblur-vs-dada2-questions/2093/7

As far as I understand the protocol correctly, the primer is designed to align within the conserved region prior / after the variable V4 region and taxonomic information shall be harvested this region.

Furthermore, when comparing sequence features across experiments, we need to ensure we are treating them in the same way. I don't see an option how to remove primers (even if specified in pcr_primers col in prep files) for per_sequence_fastq files. Thus, it is the obligation of the user to ensure sequences are treated correctly.

I myself did not :-/ It took me half a year to realize this and I had to re-analyse three projects. The issue came to my attention, because I was reading https://doi.org/10.1038/s41586-019-0878-z downloaded genomes of those bugs, used primerprospector to obtain V4 reads and checked if I can find those features in my study. The offset of having the primer still in my Deblur feature table did bias this analysis a lot :-/

Therefore, I think Qiita should warn the user of this situation, e.g. by reading the pcr_primer column, comparing the beginning / end of uploaded sequences and test if primer sequences are still contained in say more than 90% of individual reads or of 90% of features in the final feature table.

antgonza commented 5 years ago

The offset of having the primer still in my Deblur feature table did bias this analysis a lot :-/

This is interesting, do you have examples of this? How much is alot or what was the bias?

Thanks!

sjanssen2 commented 5 years ago

Regarding the 11-mix from the paper: The difference was none. We could not find any feature sequence from their genomes in our study. However, for an extended list of 40 genomes, we found 4 matching sequences in our study when primers were trimmed away, compared to 0. However, the 4 matching features are quite unspecific for the taxon / strain (thousands of 100%id matches when blasting against NR) with hundreds of different taxonomyIDs. However, different primer absence / presence can ruin promising search engines like red biom.

mortonjt commented 3 years ago

Has this issue been resolved?

Note that I'm seeing some really weird results when I pull down all of the human fecal samples with redbiom. Namely, if I run

ctx="Deblur-Illumina-16S-V4-100nt-fbc5b2"
redbiom search metadata "(feces | fecal | faecal | stool) & (sapien | human)" > human_ids.txt  
cat human_ids.txt | redbiom fetch samples --context $ctx --output human-stool-deblur.biom

I'll get a deblurred table with 993083 taxa

The weird part about this is that if I run closed-reference OTU picking with vsearch, I only get 34977 hits (there are 599482 unmatched sequences). I wonder if a lack of primer trimming is inflating the miss rate.

antgonza commented 3 years ago

The issue hasn't been resolved. Note that this specific issue is about a warning to users about this tentative problem with wetlab methods that do not ignore the linkerprimer of the sequences.

Anyway, a few questions to try to help you with your specific question:

mortonjt commented 3 years ago

Right on, thanks for responding then - maybe I should move this over to a new issue? Regarding your question

  1. No, I didn't - this was just pulling all of the human fecal data from qiita and running closed-ref picking with the default vsearch parameters.
  2. I didn't do that either. Is this a problem upstream? Would it be worthwhile to enforce quality control in qiita?
  3. I used the latest greengene reference dataset.

The command that I used is given as follows

qiime vsearch cluster-features-closed-reference --i-reference-sequences ~/Documents/databases/97_seqs.qza --i-sequences human-stool-deblur-seqs.qza --i-table human-stool-deblur.biom.qza --p-perc-identity .97 --output-dir closed-reference --p-threads 30
antgonza commented 3 years ago

No problem! Open a new issue? I guess this is fine; next time, suggest sending an email to qiita.help@gmail.com.

  1. Then I wonder how did you get to this issue or assume it was the linker primer ¯_(ツ)_/¯
  2. AFAIK deblur will not check orientation of any reference as it's a denoising step, that happens in taxonomy classification or CR; try adding --p-strand both
  3. K
wasade commented 3 years ago

@mortonjt, is it that 599482 unique features from the input deblur table failed to recruit to Greengenes, or a total of 599482 reads failed to recruit? What's the total % of reads lost following recruitment, and what fraction of the features failing to recruit are singletons or doubletons?

A few considerations:

I'm not seeing evidence of a quality control issue or a problem, but rather that there are unknowns. I'm not sure if there is a deviation in expectation as I don't think there has been much investigation (that I'm aware of at least) on the application of closed reference following deblur.

mortonjt commented 3 years ago

These are the former - 599482 unique features from the deblur table that failed to recruit to Greengenes.

  1. Understood that makes sense. Note that there are 217 features observed in at least 10000 samples, 1341 features observed in at least 1000 samples that weren't observed in GG
  2. Right, I was expecting around 80% recruitment
  3. Makes sense

If quality control isn't an issue, then I think this is really exciting! I didn't realize how many novel bugs weren't represented in GG.

wasade commented 3 years ago

The reoccurrence of features across samples is exciting. Do these recruit to greengenes at, say, 90%?