Closed Felipealbornoz closed 6 years ago
Hello again Felipe,
So the bug is --otutabout
shows only 1600 OTUs, while -uc
lists all 1601 OTUs. Is that correct?
So what is the OTU that is only in the .uc file? Can you also report the number of times it appears in the full study (;size=??) and how often it appears in each sample?
Maybe you could post your your full table and .uc file to a gist or google drive and place a link here?
Hi Colin,
no, the OTU missing does not appear in either the uc file or the OTU table, but it is in the "OTU.list.fasta" file. This file is the list of OTUs after clustering at 99% ID and removing chimeras.
I dereplicated at a minimum of 10 sequences. so this OTU should have, at the very least, 10 seqs.
Ah OK, so there is an OTU in your database that none of the reads matched to. This is surprising because this database was built from these very reads, so you would expect every read to map and for every OTU to be covered. But one OTU is not!
It's a science mystery! 🔬 🔍
Here is what I think happened:
You could fix this issue in two ways.
1
, the first hit over 99% stops the search).--search_exact
instead of --usearch_global
to perform exact matching. Let me know if this helps, Colin
Hi Colin, thanks for the suggestions!.
I set up a higher number of maxaccepts (3) and it still banishes. It is not in the produced uc or table file. but it IS in my -db file (otu list).
I'm not sure that I should do search_exact with OTUs clustered at 99%. many raw sequences will be lost because of that 1% since the OTU list is a consensus. I tried it anyways, and in fact most of the sequences were discarded for not being an exact match to the OTU list. I end up will ALL my OTUs (mystery OTU appears) but only a few sequences per OTU.
Thanks for the update. Well... I'm out of ideas. Let's see what the vsearch devs have to say.
I think this may be a classic problem of balancing the true positive rate against the false positive rate; one method always gets the exact matches, but leaves much out. The other method includes everything, but fails to find exact matches. This was even part of the vsearch paper, in which they use the F1 score to compare vsearch and usearch. There is a good wiki page about all of this.
Have you tracked down the reads from the mystery OTU, and figure out which other OTUs they go to during --usearch_global?
@Felipealbornoz What commands did you use for clustering and chimera removal?
Hi all. First thanks for your help Colin!. Torbjorn, I used the following commands:
clustering: vsearch -clusterfast derep.fasta --sizein --id 0.99 --relabel OTU --centroids OTU.99.fasta
chimeras: vsearch --uchime_denovo OTU.99.fastaa --nonchimeras OTU.nochimeras.fasta
Hi all, someone helped me with this. @colinbrislawn was pretty close by pointing out that default clustering OTUs only checks for overlapping sections. but when using -iddef 1. the problem is sloved
Glad you found the issue!
For others users that might find this issue, here's the documentation.
--iddef 0|1|2|3|4
Change the pairwise identity definition used in --id. Values accepted are:
0. CD-HIT definition: (matching columns) / (shortest sequence length).
1. edit distance: (matching columns) / (alignment length).
2. edit distance excluding terminal gaps (same as --id).
3. Marine Biological Lab definition counting each gap opening (internal or
terminal) as a single mismatch, whether or not the gap was extended: 1.0
- [(mismatches + gap openings)/(longest sequence length)]
4. BLAST definition, equivalent to --iddef 1 in a context of global pairwise
alignment.
Hi, this might be related to my previous post. I clustered OTUs at 99% ID and removed chimeras. then when mapping OTU list back to original sequences to create an OTU table, the table is missing one OTU. this OTU also does not appear in the .uc life. OTU list has 1601 OTUs, but table has only 1600.
I am using the following command: vsearch -usearch_global original.sequences.fasta -db OTU.list.fasta -strand plus --id 0.99 --maxaccepts 1 --otutabout otu.99.table.txt -uc otu.99.uc