Closed arendsee closed 6 years ago
Merging #20 into master will increase coverage by
6.19%
. The diff coverage is92.48%
.
@@ Coverage Diff @@
## master #20 +/- ##
==========================================
+ Coverage 62.76% 68.95% +6.19%
==========================================
Files 7 11 +4
Lines 384 496 +112
==========================================
+ Hits 241 342 +101
- Misses 143 154 +11
Impacted Files | Coverage Δ | |
---|---|---|
R/ap_taxid2name.R | 83.87% <83.87%> (ø) |
|
R/ap_children.R | 90.9% <90.9%> (ø) |
|
R/ap_classification.R | 92.45% <92.45%> (ø) |
|
R/ap_name2taxid.R | 94.28% <94.28%> (ø) |
|
R/ap_core.R | 98% <98%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c0e77ed...b1da38b. Read the comment docs.
Actually, having looked more closely at taxize::children
, I realize taxizedb::ncbi_children
is missing a few parameters. I'll review the new functions and port over the relevant parameters.
I think I can remove the name2taxid_map
function, since taxize::children
instead sets out_type='summary'
to get the same result.
Sorry, I didn't quite realize how incomplete this was ...
okay, will wait
OK, now name2taxid_map is working for ambiguous cases:
taxizedb::name2taxid_map('Bacteria')
# A tibble: 2 x 2
name_txt tax_id
<chr> <int>
1 Bacteria 2
2 Bacteria 629395
OK, I killed the name2taxid_map function, now:
> name2taxid('Bacteria')
[1] "2"
> name2taxid('Bacteria', out_type='summary')
Auto-disconnecting SQLiteConnection
Database already exists, returning old file
# A tibble: 2 x 2
name_txt tax_id
<chr> <int>
1 Bacteria 2
2 Bacteria 629395
There is still the matter of bad handling of ambiguous cases, I'll get that next.
Perhaps I should have the output of name2taxid
as an integer? Problem with that, though, is the IDs are not really numeric (e.g. it doesn't make sense to add them to one another).
i think character makes sense, as you say doesn't make sense to add them together
thinking about fxn names (let me know if I misunderstand the goal of the functions):
name2taxid
is similar to the get_*()
family of functions in taxize
. I guess this function has less metadata around it (the attributes returned in get_*
functions), but essentially doing the same thing. the get_*
fxns are named poorly i think, but it's too late now to change them 😄 - anyway, we could keep consistency with taxize here, or maybe it's not worth it. thoughts?taxid2name
- i don't think there's an equiv. in taxize
, so name is good i thinkHmm, hard decision. I like the symmetry of name2taxid
and taxid2name
. Maybe we could add get_*
functions as database specific wrappers around the name2taxid
function? They can be designed to behave the same as their taxize counterparts.
With the new commit, name2taxid
dies with clear message on ambiguous names when the output format is 'uid'. Also, name2taxid(x, out_type='summary')
now casts ids as characters, following our "ID as character" convention.
It definitely would be nice to add some of the ambiguity resolving arguments used in get_uid
(division_filter
and rank_filter
). However, I may need to leave this for later.
i think we can leave fxn names as you have for now.
It definitely would be nice to add some of the ambiguity resolving arguments used in get_uid (division_filter and rank_filter). However, I may need to leave this for later.
right, makes sense to think about later.
Description
I've added four new functions:
children
- this exactly replicates thetaxize::children
functiontaxid2name
- convert a vector of taxon IDs to a vector of scientific namesname2taxid
- convert a vector of names to a vector of taxon IDs (die on ambiguous names)name2taxid_map
- convert a vector of names to a data.frame, with columns for taxon ids and scientific names (this allows ambiguity)All of these currently only work for the NCBI database.
Example
There are a couple wrinkles that to be ironed out:
I'll fix these problems in a bit.