sherrillmix / taxonomizr

Parse NCBI taxonomy and accessions to find taxonomic assignments
GNU General Public License v2.0
70 stars 11 forks source link

matching old accession number #6

Closed ngeraldi closed 6 years ago

ngeraldi commented 6 years ago

Hi, great package. I am using it to fix/define taxonomic assignments from a Silva 18s database. The dmp files that I just downloaded seem to have slightly different accession numbers than the Silva database and thus I am getting NA's after running the accessionToTaxa function. See below for example. Is there a quick way to remove characters including and to the right of the first ".", perhaps within the sql? Removing everything after the first "." in both databases should result in the correct taxonomy, as far as I can see. I am new to github and using sqls, so I apologize if this is not a good place for this question. thanks

Silva accession numbers

"AC090637.149908.151196","AC091599.220.1669","AC091632.4938.6802","AC207586.19448.20862", "JQ776649.1.1382","JQ781512.1.1275"

accession numbers that work in recently downloaded and processed sql. (using your package) and that match with current NCBI version number (double checked on web).

"AC090637.2","AC091599.1","AC091632.1","AC207586.3","JQ776649.2","JQ781512.1"

sherrillmix commented 6 years ago

Hmm. Those are some weird numbers from Silva. Do you have any idea what they're doing? They seem a bit unusual since as far as I know it's usually ID#.version# so AC090637.149908.151196 should mean the 149908th version of sequence AC090637 (plus the additional 151196). I didn't see anything on a quick google search.

But in any case, you should probably manipulate the numbers before passing to taxonomizr. For example, something like:

silvaIDs<-c("AC090637.149908.151196","AC091599.220.1669","AC091632.4938.6802","AC207586.19448.20862", "JQ776649.1.1382","JQ781512.1.1275")
baseIDs<-sub('\\..*','',silvaIDs)

would probably get you the base accession without the extra Silva stuff. Unfortunately, taxonomizr currently wants versioned accession numbers. I've been meaning to add an option to allow unversioned accession numbers so this is good motivation. I'll add that in the next day or two and get back to you.

ngeraldi commented 6 years ago

Thanks for the quick response. I spent about 30 minutes and could determine how the Silva accession numbers are made, just that they base them on the embl numbers, but these are the same as NCBI and not like Silva's. Thanks for the suggestions and will stand by for the taxonomizr upgrade.

sherrillmix commented 6 years ago

Sorry this took me a while to get in the package but I think the current github version should work with the versionless accession numbers. You'll need to delete the accession database and rebuild with read.accession2taxid. Then you should be able to get the taxonomy with:

silvaIDs<-c("AC090637.149908.151196","AC091599.220.1669","AC091632.4938.6802","AC207586.19448.20862", "JQ776649.1.1382","JQ781512.1.1275")
baseIDs<-sub('\\..*','',silvaIDs)
accessionToTaxa(baseIDs,"accessionTaxa.sql",version='base')
[1]    9606    9606    9606    9606 1177578     471