The RefSeq DB was created from signatures calculated with --name-from-first, and the name is used to figure out what is the taxid for that signature. This leads to missing some name assignments because sometimes the first record name is not in the accession2taxid file provided by CAMI.
Possible solutions:
avoid using --name-from-first and calculate a 'majority name' from the signature.
similar: check when building sigs that the name assignment is possible later. Would mean loading the acession2taxid during signature calculation, and making sure it is present.
use --name with the GCF acession. This is not really a solution for the data provided for CAMI (because it doesn't have an assembly -> taxid assignment), but it would be a good suggestion to add to the provided CAMI data. (See how https://github.com/pirovc/genome_updater works, where it saves the assembly_summary.txt that contains the taxid for each assembly).
The RefSeq DB was created from signatures calculated with
--name-from-first
, and the name is used to figure out what is the taxid for that signature. This leads to missing some name assignments because sometimes the first record name is not in theaccession2taxid
file provided by CAMI.Possible solutions:
--name-from-first
and calculate a 'majority name' from the signature.acession2taxid
during signature calculation, and making sure it is present.--name
with the GCF acession. This is not really a solution for the data provided for CAMI (because it doesn't have an assembly -> taxid assignment), but it would be a good suggestion to add to the provided CAMI data. (See how https://github.com/pirovc/genome_updater works, where it saves theassembly_summary.txt
that contains the taxid for each assembly).