Open zachary-foster opened 7 years ago
Makes sense to do validation of ranks when database is specified and be specific to a database. ✔️
For order checking, if the ordering is wrong, do we error with message giving the correct order, or correct the order with message?
what is going on with the replication in ranks_ref
I believe i combined all the diff. rank names from diff. data providers, and just left duplicates in, they can def. be taken out.
Do you think this is important to get done before cran push?
For order checking, if the ordering is wrong, do we error with message giving the correct order, or correct the order with message?
I think correcting the order with a message makes more sense.
Thinking about this more, there might be some valid ranks that do no have an order (e.g. "unranked") and can therefore be put in any order relative to ordered ranks (e.g. "species").
This makes correcting for order and encoding it more difficult.
I dont think an ordered factor would work since AFAIK you cant make a ordered factor with some values ordered and other not. Below are the options for encoding the valid ranks in the database
class I came up with that handle a mixture of ordered and unordered ranks.
vector
> c("1" = "root", "2" = "domain", "3" = "kingdom", "NA" = "unkranked")
1 2 3 NA
"root" "domain" "kingdom" "unkranked"
list
> list("1" = c("root", "unkranked"), "2" = c("domain", "unkranked"), "3" = c("kingdom", "unkranked"))
$`1`
[1] "root" "unkranked"
$`2`
[1] "domain" "unkranked"
$`3`
[1] "kingdom" "unkranked"
data.frame
> data.frame(order = c(1:3, NA), rank = c("root", "domain", "kingdom", "unkranked"))
order rank
1 1 root
2 2 domain
3 3 kingdom
4 NA unkranked
Instead of the NA
, we could replicate "unranked" for each level as was done for the list above.
two categories
ordered_ranks <- c("root", "domain", "kingdom")
unordered_ranks <- c("unkranked")
Do you like any of those in particular or have another solution?
, they can def. be taken out.
Cool
Do you think this is important to get done before cran push?
Its not essential. If it turns out to be an easy change then I think it would be nice, but leaving it out would not pose any backward compatibility issues since we can use database-less ranks for now.
I like dealing with either lists or data.frame's of the options you gave.
I lean towards putting this off until next cran push
I like dealing with either lists or data.frame's of the options you gave.
Cool. I like the data.frame
the most since it avoids numbers/NA as names.
I lean towards putting this off until next cran push
fine with me
Hi @sckott,
I am working on the vignette and I started thinking about ranks. I remember that the ranks used to have to match something in
/data/ranks_ref.rda
, but I suggested removing that validation since there is too much diversity in rank names to encode easily. Now I am thinking that we can do something in between.What if the valid ranks were associated with the
database
class, like theid_regex
. We could add arank_regex
option that takes one or more regexs that ranks have to match if adatabase
is defined? Alternatively, if we want to encode rank order as well, then maybe an ordered factor (not of regex) of possible ranks calledvalid_ranks
? In both cases, if a database is defined, then the rank names must be valid (rank
constructor) and in a logical order (hierarchy
andtaxonomy
constructors) or an error is thrown; if the database is not defined, then anything goes.In this design
/data/ranks_ref.rda
would be removed and perhaps replaced with a list ofdatabase
objects included with the package.Thoughts?
Also, what is going on with the replication in
ranks_ref
?Thanks