Unranked targets when building custom DB

donovan-h-parks commented 1 year ago

Hi,

I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build indicates 262383 targets remain unranked. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?

Thanks, Donovan

>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv

Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.2.3 (20220708)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets              8302685
ranked targets       0
taxa in tree         401816
------------------------------------------------
buckets              964032481
bucket size          max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features             518555067
dead features        0
locations            24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.

Funatiq commented 1 year ago

Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)?

Yes, exactly.

Is there any way to determine which sequences remain unranked?

Unfortunately, there is no direct way. I think the best way would be to generate a list of all targets using

matacache info <database> lin

and check if the columns of the lineages are 0 (which means no valid TaxonId).

donovan-h-parks commented 1 year ago

Hi,

Thanks for suggesting metacache info <database> lin. I've been able to use this to identify that the issue relates to FASTA files from NCBI that have accessions of a specific form, e.g. NZ_CAJRAF010000001.1. It appears MetaCache munges this name and records it as CAJRAF010000001.1. This appears to be related to the length of the accession as NZ_AUDU01000044.1 is retained as NZ_AUDU01000044.1.

Is this the expected operation of MetaCache and I should take care to remove any characters before an initial underscore from accessions >18 characters when providing a mapping file to -taxpostmap? This seems like a brittle rule for me to implement so was hoping you could provide some details on how MetaCache modifies accessions.

Thanks, Donovan

muellan commented 1 year ago

Hi,

accession numbers are a total mess. We use a regex to identify NCBI-style accession or accession.version sequence identifiers. For some reason that I don't remember we only allow the letter part to be 7 characters long (including the underscore).

If you want a super quick fix, go to the file "src/sequence_io.cpp" line 471 and replace the regex "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,6}[0-9]{5,})(\\.[0-9]+)?)" with "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,9}[0-9]{5,})(\\.[0-9]+)?)" (notice that the 6 got replaced with 9 which allows up to 10 letter characters at the start of the accession id) and recompile metacache. That should solve your problem. I think we'll include such a change in the next release if it doesn't break anything.

André

donovan-h-parks commented 1 year ago

Hi,

Thanks. I can look to implement the same regex expression to ensure consistency. Why is modifying accessions necessary? I'd like to add in my own genomes that don't necessarily have NCBI-style accessions. This is made a bit more complicated if I have to account for changes that might be made by MetaCache.

Thanks, Donovan

muellan commented 4 months ago

This should be solved by the changes in version v.2.4.2.

muellan / metacache

Unranked targets when building custom DB #35