Open donovan-h-parks opened 1 year ago
Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)?
Yes, exactly.
Is there any way to determine which sequences remain unranked?
Unfortunately, there is no direct way. I think the best way would be to generate a list of all targets using
matacache info <database> lin
and check if the columns of the lineages are 0 (which means no valid TaxonId).
Hi,
Thanks for suggesting metacache info <database> lin
. I've been able to use this to identify that the issue relates to FASTA files from NCBI that have accessions of a specific form, e.g. NZ_CAJRAF010000001.1
. It appears MetaCache munges this name and records it as CAJRAF010000001.1
. This appears to be related to the length of the accession as NZ_AUDU01000044.1
is retained as NZ_AUDU01000044.1
.
Is this the expected operation of MetaCache and I should take care to remove any characters before an initial underscore from accessions >18 characters when providing a mapping file to -taxpostmap
? This seems like a brittle rule for me to implement so was hoping you could provide some details on how MetaCache modifies accessions.
Thanks, Donovan
Hi,
accession numbers are a total mess. We use a regex to identify NCBI-style accession or accession.version sequence identifiers. For some reason that I don't remember we only allow the letter part to be 7 characters long (including the underscore).
If you want a super quick fix, go to the file "src/sequence_io.cpp" line 471 and replace the regex "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,6}[0-9]{5,})(\\.[0-9]+)?)"
with "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,9}[0-9]{5,})(\\.[0-9]+)?)"
(notice that the 6 got replaced with 9 which allows up to 10 letter characters at the start of the accession id) and recompile metacache.
That should solve your problem. I think we'll include such a change in the next release if it doesn't break anything.
André
Hi,
Thanks. I can look to implement the same regex expression to ensure consistency. Why is modifying accessions necessary? I'd like to add in my own genomes that don't necessarily have NCBI-style accessions. This is made a bit more complicated if I have to account for changes that might be made by MetaCache.
Thanks, Donovan
Hi,
I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the
NCBI-style accession2taxid tab-separated files
. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output ofmetacache build
indicates262383 targets remain unranked
. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?Thanks, Donovan