torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

Sintax sometimes only outputs the ID with no further columns #511

Closed Shellfishgene closed 1 year ago

Shellfishgene commented 1 year ago

Hi!

The sintax command sometimes outputs only the ID column for some of my sequences.

$ usearch -sintax error_seq.fq -db SILVA138_RESCRIPt.udb -sintax_cutoff 0.6 -strand both -tabbedout usearch.txt
$ cat usearch.txt 
A00808:1162:HGN3HDRX2:1:2268:21649:2973 1:N:0:AGACCTTG+GATGCTAC d:Eukaryota(0.6300),p:Arthropoda(0.1008),c:Insecta(0.0141),o:Diptera(0.0017),f:Diptera(0.0002),g:Diptera(0.0000),s:Sarcophaga_shirakii(0.0000)      +       d:Eukaryota

$ vsearch -sintax error_seq.fq -db SILVA138_RESCRIPt.udb -sintax_cutoff 0.6 -strand both -tabbedout vsearch.txt
$ cat vsearch.txt 
A00808:1162:HGN3HDRX2:1:2268:21649:2973 1:N:0:AGACCTTG+GATGCTAC

The minimal example given in #493 also produces only the ID in v2.22.1_linux_x86_64. However that's for no match, in the example above usearch produces a match at least for Eukaryota.

torognes commented 1 year ago

Hi, thanks for pointing out this issue. It's difficult to say what is happening here without the actual sequences. Could you send me the error_seq.fq file? I've got the SILVA file.

Since the probability for Eukaroyta in the example is just 0.63 and the cutoff is 0.60, I suspect that there is some random variation in the results because the SINTAX algorithm has some built-in randomness. That could cause a match in some runs and no match in other runs. You might get different results if you run it again.

Shellfishgene commented 1 year ago

See the sequence below. Independently of the cutoff, is it expected behaviour that vsearch outputs only the ID when there is no match? I noticed this because usearch's sintax_summary command can't deal with that.

>A00808:1162:HGN3HDRX2:1:2268:21649:2973 1:N:0:AGACCTTG+GATGCTAC
AGCCAATTAAGATCCCAACTGGTTCACGTGGCTCACACTCCTACAACATGTTCTGTTCAGAGTATTTCAAGTCAGGTGAGAACCCTGATAATGTTTTCAAACACTATAAGGACAGATTATTTACATGCATTATAACTATTATAGACCATGGCTAAAATATAGGGTAACATT
torognes commented 1 year ago

Yes, that is expected behaviour, the ID followed by two empty columns, separated by tabs. It seems like usearch may always or more often give a match, no matter the confidence.

Shellfishgene commented 1 year ago

I see, then I was just confused by #493 stating there should always be four columns also with no match. If this is the expected behaviour, I'll close this. Thanks!

torognes commented 1 year ago

Sorry, when there is no match, you'll have the ID in the first column followed by two or three empty columns. There will be a total of four columns if the --sintax_cutoff option was used, otherwise a total of three columns.

frederic-mahe commented 1 year ago

tests covering that issue (https://github.com/frederic-mahe/vsearch-tests/commit/9465796f7151b65c8b41e8f13d8064c90ecaa396)