How are we handling rank validation?

zachary-foster commented 7 years ago

Hi @sckott,

I am working on the vignette and I started thinking about ranks. I remember that the ranks used to have to match something in /data/ranks_ref.rda, but I suggested removing that validation since there is too much diversity in rank names to encode easily. Now I am thinking that we can do something in between.

What if the valid ranks were associated with the database class, like the id_regex. We could add a rank_regex option that takes one or more regexs that ranks have to match if a database is defined? Alternatively, if we want to encode rank order as well, then maybe an ordered factor (not of regex) of possible ranks called valid_ranks? In both cases, if a database is defined, then the rank names must be valid (rank constructor) and in a logical order (hierarchy and taxonomy constructors) or an error is thrown; if the database is not defined, then anything goes.

In this design /data/ranks_ref.rda would be removed and perhaps replaced with a list of database objects included with the package.

Thoughts?

Also, what is going on with the replication in ranks_ref?

   rankid                                                             ranks
1      05                                                      superkingdom
2      10                   kingdom,kingdom,kingdom,kingdom,kingdom,kingdom
3      20 subkingdom,subkingdom,subkingdom,subkingdom,subkingdom,subkingdom
4      25                                                      infrakingdom
5      30                   phylum,phylum,division,division,phylum,division
6      40 subphylum,subphylum,subdivision,subdivision,subphylum,subdivision
7      45                                                     infradivision
8      50            superclass,superclass,superclass,superclass,superclass
9      60                               class,class,class,class,class,class
10     70             subclass,subclass,subclass,subclass,subclass,subclass ...

Thanks

sckott commented 7 years ago

Makes sense to do validation of ranks when database is specified and be specific to a database. ✔️

For order checking, if the ordering is wrong, do we error with message giving the correct order, or correct the order with message?

what is going on with the replication in ranks_ref

I believe i combined all the diff. rank names from diff. data providers, and just left duplicates in, they can def. be taken out.

Do you think this is important to get done before cran push?

zachary-foster commented 7 years ago

For order checking, if the ordering is wrong, do we error with message giving the correct order, or correct the order with message?

I think correcting the order with a message makes more sense.

Thinking about this more, there might be some valid ranks that do no have an order (e.g. "unranked") and can therefore be put in any order relative to ordered ranks (e.g. "species"). This makes correcting for order and encoding it more difficult. I dont think an ordered factor would work since AFAIK you cant make a ordered factor with some values ordered and other not. Below are the options for encoding the valid ranks in the database class I came up with that handle a mixture of ordered and unordered ranks.

vector

> c("1" = "root", "2" = "domain", "3" = "kingdom", "NA" = "unkranked")
          1           2           3          NA 
     "root"    "domain"   "kingdom" "unkranked"

list

> list("1" = c("root", "unkranked"), "2" = c("domain", "unkranked"), "3" = c("kingdom", "unkranked"))
$`1`
[1] "root"      "unkranked"

$`2`
[1] "domain"    "unkranked"

$`3`
[1] "kingdom"   "unkranked"

data.frame

> data.frame(order = c(1:3, NA), rank = c("root", "domain", "kingdom", "unkranked"))
  order      rank
1     1      root
2     2    domain
3     3   kingdom
4    NA unkranked

Instead of the NA, we could replicate "unranked" for each level as was done for the list above.

two categories

ordered_ranks  <- c("root", "domain", "kingdom")
unordered_ranks <-  c("unkranked")

Do you like any of those in particular or have another solution?

, they can def. be taken out.

Cool

Do you think this is important to get done before cran push?

Its not essential. If it turns out to be an easy change then I think it would be nice, but leaving it out would not pose any backward compatibility issues since we can use database-less ranks for now.

sckott commented 7 years ago

I like dealing with either lists or data.frame's of the options you gave.

I lean towards putting this off until next cran push

zachary-foster commented 7 years ago

I like dealing with either lists or data.frame's of the options you gave.

Cool. I like the data.frame the most since it avoids numbers/NA as names.

I lean towards putting this off until next cran push

fine with me

ropensci / taxa

How are we handling rank validation? #40