torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
669 stars 123 forks source link

OTU missing in OTU table #294

Closed Felipealbornoz closed 6 years ago

Felipealbornoz commented 6 years ago

Hi, this might be related to my previous post. I clustered OTUs at 99% ID and removed chimeras. then when mapping OTU list back to original sequences to create an OTU table, the table is missing one OTU. this OTU also does not appear in the .uc life. OTU list has 1601 OTUs, but table has only 1600.

I am using the following command: vsearch -usearch_global original.sequences.fasta -db OTU.list.fasta -strand plus --id 0.99 --maxaccepts 1 --otutabout otu.99.table.txt -uc otu.99.uc

colinbrislawn commented 6 years ago

Hello again Felipe,

So the bug is --otutabout shows only 1600 OTUs, while -uc lists all 1601 OTUs. Is that correct?

So what is the OTU that is only in the .uc file? Can you also report the number of times it appears in the full study (;size=??) and how often it appears in each sample?

Maybe you could post your your full table and .uc file to a gist or google drive and place a link here?

Felipealbornoz commented 6 years ago

Hi Colin,

no, the OTU missing does not appear in either the uc file or the OTU table, but it is in the "OTU.list.fasta" file. This file is the list of OTUs after clustering at 99% ID and removing chimeras.

I dereplicated at a minimum of 10 sequences. so this OTU should have, at the very least, 10 seqs.

colinbrislawn commented 6 years ago

Ah OK, so there is an OTU in your database that none of the reads matched to. This is surprising because this database was built from these very reads, so you would expect every read to map and for every OTU to be covered. But one OTU is not!

It's a science mystery! 🔬 🔍

Here is what I think happened:

  1. One of the OTUs in the database is very similar to a more common OTU, so when --usearch_global is run, reads end up mapping to the other, common OTU instead of it's OTUs of origin.
  2. Because nothing matches to this OTU, it's not included in the output .txt or .uc files.

You could fix this issue in two ways.

Let me know if this helps, Colin

Felipealbornoz commented 6 years ago

Hi Colin, thanks for the suggestions!.

  1. I set up a higher number of maxaccepts (3) and it still banishes. It is not in the produced uc or table file. but it IS in my -db file (otu list).

  2. I'm not sure that I should do search_exact with OTUs clustered at 99%. many raw sequences will be lost because of that 1% since the OTU list is a consensus. I tried it anyways, and in fact most of the sequences were discarded for not being an exact match to the OTU list. I end up will ALL my OTUs (mystery OTU appears) but only a few sequences per OTU.

colinbrislawn commented 6 years ago

Thanks for the update. Well... I'm out of ideas. Let's see what the vsearch devs have to say.

I think this may be a classic problem of balancing the true positive rate against the false positive rate; one method always gets the exact matches, but leaves much out. The other method includes everything, but fails to find exact matches. This was even part of the vsearch paper, in which they use the F1 score to compare vsearch and usearch. There is a good wiki page about all of this.

Have you tracked down the reads from the mystery OTU, and figure out which other OTUs they go to during --usearch_global?

torognes commented 6 years ago

@Felipealbornoz What commands did you use for clustering and chimera removal?

Felipealbornoz commented 6 years ago

Hi all. First thanks for your help Colin!. Torbjorn, I used the following commands:

clustering: vsearch -clusterfast derep.fasta --sizein --id 0.99 --relabel OTU --centroids OTU.99.fasta

chimeras: vsearch --uchime_denovo OTU.99.fastaa --nonchimeras OTU.nochimeras.fasta

Felipealbornoz commented 6 years ago

Hi all, someone helped me with this. @colinbrislawn was pretty close by pointing out that default clustering OTUs only checks for overlapping sections. but when using -iddef 1. the problem is sloved

colinbrislawn commented 6 years ago

Glad you found the issue!

For others users that might find this issue, here's the documentation.

--iddef 0|1|2|3|4
Change the pairwise identity definition used in --id. Values accepted are:
    0. CD-HIT definition: (matching columns) / (shortest sequence length).
    1. edit distance: (matching columns) / (alignment length).
    2. edit distance excluding terminal gaps (same as --id).
    3. Marine Biological Lab definition counting each gap opening (internal or
terminal) as a single mismatch, whether or not the gap was extended: 1.0
- [(mismatches + gap openings)/(longest sequence length)]
    4. BLAST definition, equivalent to --iddef 1 in a context of global pairwise
alignment.