torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

usearch_global shows different results for the same seq in different db files #477

Closed ClinicalAI closed 1 year ago

ClinicalAI commented 2 years ago

I have a FASTA read file (around 1million reads, each with length ~150bp) and two database (one with 40 sequences and another with 80 sequences with length ~300-1000). Some sequences are same in the db1 and db2, when I run the following command:

for db1: vsearch --usearch_global read_file.fasta --db db1.fasta --strand both --otutabout out_db1.tsv

for db2: vsearch --usearch_global read_file.fasta --db db2.fasta --strand both --otutabout out_db2.tsv

for example the following seq is available in both db1 and db2:

>Case37
AGGGATGAACTCTTGCATGCCACAAAAGTATTGGTAATAAGCGTTCTCACTCCATTGCTCCACAACGGATTCATCCGAGAGATTACGAAGGTGTTTGAGAATCAACAAGCCACACATCAAACGAATCGGCTTACTCGGGCGACCATTGTGTTGACAGTACAAAGGCCGGAAAGCCATATCGAATTTCTCCCAATCTATTTTGTCGGCCAATTTGTAGAGTGGGTGCGTTTGGTTCAGCATGTCTGATAGACTGCTGAATAATGAAGGACTATGATGGGACTTTTTAATCA

when I checked the results, I see very different result for case37 in out_db1.tsv and out_db2.tsv. Is that because of heuristic nature of the usrerach_global? or any thing else?

Actually, I am looking for a way to find all the hit from the read_file.fasta that I have in my these db1.fasta and db2.fasta. I am not sure using usearch_global is a good option and it returns all the hit or I should use any other option of vsearch. Any help will be appreciated.

Thank you,

frederic-mahe commented 2 years ago

hi @ClinicalAI could you please try again with the --maxaccept 0 --maxreject 0 options?

That would eliminate situations were comparisons for a particular query stopped before reaching Case37.

ClinicalAI commented 2 years ago

@frederic-mahe Thanks, yes. I used --maxaccept 0 --maxreject 0 and still the results are different. You can see the results in the following files. Some sequences are repeated in the both files but their hit number is very different like Case29. CRR224758_db_1.csv CRR224758_db_2.csv

You may make a toy problem and try it by yourself. You need two db files while some of sequences are same in these two.

ClinicalAI commented 2 years ago

Hello. Thanks for the time you put to reply the questions. I am wondering if there is any reply for this?

frederic-mahe commented 2 years ago

Hello,

for db1: vsearch --usearch_global read_file.fasta --db db1.fasta --strand both --otutabout out_db1.tsv

That command is missing an --id threshold.

Some sequences are same in the db1 and db2

Are these db sequences strictly identical? or similar with a small percentage of divergence?

You may make a toy problem and try it by yourself.

The information you gave us is not sufficient to identify the issue and try to replicate it.

for example the seq Case37 is available in both db1 and db2: when I checked the results, I see very different result for case37 in out_db1.tsv and out_db2.tsv.

Maybe there is a sequence in db2.fasta that is a better match for your reads than Case37? That would explain the difference between the two runs.

Actually, I am looking for a way to find all the hit from the read_file.fasta that I have in my these db1.fasta and db2.fasta. I am not sure using usearch_global is a good option and it returns all the hit or I should use any other option of vsearch.

Do you want to find identical matches (vsearch --search_exact)? or matches with at least a certain percentage of similarity (vsearch --usearch_global)?