Closed dleopold closed 7 months ago
Hi, thanks for the feedback.
I think this depends on whether the --notrunclabels
option was specified or not when the UDB file was created with the makeudb_usearch
command. The default is to truncate the headers at the first space. If the --notrunclabels
is given the full header is included.
If the UDB file is created without --notrunclabels
, only the initial part (the id) will be kept in the UDB file. If the UDB files is created with the --notrunclabels
option the full header is included and will also be shown in the results when searching this database, even if --notrunclabels
is not included in the search command.
One could argue that when searching without the --notrunclabels
option, the headers should always be truncated, even if they come from a UDB file with the full headers included. One could also argue that the makedb_usearch
command should always include the full headers, however some user may want to save some space by truncating long headers.
I hope this clarifies the issue.
I do not think the following is correct, unless you also specify the --notrunclabels
option:
In contrast, when a fasta file is used directly as the reference database the sequence headers are not truncated in the returned targets / hits.
In which situation did you experience this?
You are correct, I just needed to pass --notrunclabels
to --makeudb_usearch
when pre-generating the database. Thank you!
tests added to our test suite https://github.com/frederic-mahe/vsearch-tests/commit/85162a264febba1e4d259c77e6513800241fcd48
When a udb file is pre-generated, the ids of the targets returned by --usearch_global are identified by the sequence headers truncated at the first space. In contrast, when a fasta file is used directly as the reference database the sequence headers are not truncated in the returned targets / hits. Either approach is probably fine, but it seems like search results should be consistent regardless of whether the --db file is .fasta or .udb.