ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

Add `children`, `taxid2name`, `name2taxid` and `name2taxid_map` functions #20

Closed arendsee closed 6 years ago

arendsee commented 6 years ago

Description

I've added four new functions:

  1. children - this exactly replicates the taxize::children function
  2. taxid2name - convert a vector of taxon IDs to a vector of scientific names
  3. name2taxid - convert a vector of names to a vector of taxon IDs (die on ambiguous names)
  4. name2taxid_map - convert a vector of names to a data.frame, with columns for taxon ids and scientific names (this allows ambiguity)

All of these currently only work for the NCBI database.

Example

children(3701)

taxid2name(3701, 99999999, 9606)

# unicorns are real but dragons aren't
name2taxid('pig', 'Homo sapiens', 'unicorn', 'dragon', 'horse')

name2taxid_map(c('pig', 'Homo sapiens', 'unicorn', 'dragon', 'horse'))

There are a couple wrinkles that to be ironed out:

# 'dragon' should be named
name2taxid_map(c('pig', 'Homo sapiens', 'unicorn', 'dragon', 'horse'))

# this should fail
name2taxid('Bacteria')

# this should return the walking sticks as well
name2taxid_map('Bacteria')

I'll fix these problems in a bit.

codecov-io commented 6 years ago

Codecov Report

Merging #20 into master will increase coverage by 6.19%. The diff coverage is 92.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #20      +/-   ##
==========================================
+ Coverage   62.76%   68.95%   +6.19%     
==========================================
  Files           7       11       +4     
  Lines         384      496     +112     
==========================================
+ Hits          241      342     +101     
- Misses        143      154      +11
Impacted Files Coverage Δ
R/ap_taxid2name.R 83.87% <83.87%> (ø)
R/ap_children.R 90.9% <90.9%> (ø)
R/ap_classification.R 92.45% <92.45%> (ø)
R/ap_name2taxid.R 94.28% <94.28%> (ø)
R/ap_core.R 98% <98%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c0e77ed...b1da38b. Read the comment docs.

arendsee commented 6 years ago

Actually, having looked more closely at taxize::children, I realize taxizedb::ncbi_children is missing a few parameters. I'll review the new functions and port over the relevant parameters.

I think I can remove the name2taxid_map function, since taxize::children instead sets out_type='summary' to get the same result.

Sorry, I didn't quite realize how incomplete this was ...

sckott commented 6 years ago

okay, will wait

arendsee commented 6 years ago

OK, now name2taxid_map is working for ambiguous cases:

taxizedb::name2taxid_map('Bacteria')
# A tibble: 2 x 2
  name_txt tax_id
     <chr>  <int>
1 Bacteria      2
2 Bacteria 629395
arendsee commented 6 years ago

OK, I killed the name2taxid_map function, now:

> name2taxid('Bacteria')
[1] "2"
> name2taxid('Bacteria', out_type='summary')
Auto-disconnecting SQLiteConnection
Database already exists, returning old file
# A tibble: 2 x 2
  name_txt tax_id
     <chr>  <int>
1 Bacteria      2
2 Bacteria 629395

There is still the matter of bad handling of ambiguous cases, I'll get that next.

Perhaps I should have the output of name2taxid as an integer? Problem with that, though, is the IDs are not really numeric (e.g. it doesn't make sense to add them to one another).

sckott commented 6 years ago

i think character makes sense, as you say doesn't make sense to add them together

sckott commented 6 years ago

thinking about fxn names (let me know if I misunderstand the goal of the functions):

arendsee commented 6 years ago

Hmm, hard decision. I like the symmetry of name2taxid and taxid2name. Maybe we could add get_* functions as database specific wrappers around the name2taxid function? They can be designed to behave the same as their taxize counterparts.

arendsee commented 6 years ago

With the new commit, name2taxid dies with clear message on ambiguous names when the output format is 'uid'. Also, name2taxid(x, out_type='summary') now casts ids as characters, following our "ID as character" convention.

It definitely would be nice to add some of the ambiguity resolving arguments used in get_uid (division_filter and rank_filter). However, I may need to leave this for later.

sckott commented 6 years ago

i think we can leave fxn names as you have for now.

It definitely would be nice to add some of the ambiguity resolving arguments used in get_uid (division_filter and rank_filter). However, I may need to leave this for later.

right, makes sense to think about later.