--makeudb_usearch truncates fasta headers

dleopold commented 7 months ago

When a udb file is pre-generated, the ids of the targets returned by --usearch_global are identified by the sequence headers truncated at the first space. In contrast, when a fasta file is used directly as the reference database the sequence headers are not truncated in the returned targets / hits. Either approach is probably fine, but it seems like search results should be consistent regardless of whether the --db file is .fasta or .udb.

torognes commented 7 months ago

Hi, thanks for the feedback.

I think this depends on whether the --notrunclabels option was specified or not when the UDB file was created with the makeudb_usearch command. The default is to truncate the headers at the first space. If the --notrunclabels is given the full header is included.

If the UDB file is created without --notrunclabels, only the initial part (the id) will be kept in the UDB file. If the UDB files is created with the --notrunclabels option the full header is included and will also be shown in the results when searching this database, even if --notrunclabels is not included in the search command.

One could argue that when searching without the --notrunclabels option, the headers should always be truncated, even if they come from a UDB file with the full headers included. One could also argue that the makedb_usearch command should always include the full headers, however some user may want to save some space by truncating long headers.

I hope this clarifies the issue.

I do not think the following is correct, unless you also specify the --notrunclabels option:

In contrast, when a fasta file is used directly as the reference database the sequence headers are not truncated in the returned targets / hits.

In which situation did you experience this?

dleopold commented 7 months ago

You are correct, I just needed to pass --notrunclabels to --makeudb_usearch when pre-generating the database. Thank you!

frederic-mahe commented 7 months ago

tests added to our test suite https://github.com/frederic-mahe/vsearch-tests/commit/85162a264febba1e4d259c77e6513800241fcd48

torognes / vsearch

--makeudb_usearch truncates fasta headers #543