ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

Vectorizing some classes and reorganizing inheritance hierarchy #197

Open zachary-foster opened 5 years ago

zachary-foster commented 5 years ago

I have been thinking about how to make all of the taxa classes most useful and I think I have come up with a way to reorganize things that will make things more natural and R-like, while increasing flexibility.

Vectorizing the TaxonName, TaxonRank, TaxonId, and TaxonDatabase classes

Right now these store individual values and an associated TaxonDatabase object. However, if they were like other R objects, they would allow multiple values. Currently, their primary use is to store information in Taxon objects, but if they could hold multiple values, then they could be used on their own more usefully. They would also be useful as column in tibbles if this PR gets accepted.

Question: do we allow multiple TaxonDatabase objects in a single TaxonName, TaxonRank, or TaxonId object with multiple values, or do all values have to come from the same database?

Question: What happens when we combine objects from different database with c.TaxonName?

We could allow to hold multiple values and behave like a factor.

Reorganizing inheritance

Until a recent changes in the eval branch, the class inheritance looked like this:

TaxonName
TaxonRank
TaxonId
TaxonDatabase
Taxa
Taxonomy
└─Taxmap
Hierarchy
Hierarchies

I think it would be much more elegant code-wise, to do this:

TaxonName
TaxonRank
TaxonId
TaxonDatabase
Taxa
├─Taxonomy
│ ├─Taxmap
│ └─Hierarchies
└─Hierarchy

Taxonomy has most of the important/complicated methods and Taxa has most of the important getters/setters. In this setup, Hierarchies would be the same as Taxonomy internally, but have different getters/setters and print method to make it appear 1-dimentional. This would be easier for users to understand, but also preserve the performance improvements of Taxonomy. Also things like filter_taxa(taxon_ranks != "family") would work on Hierarchies objects without modification to the code for filter_taxa.

Hierarchies would be particularly useful as a column in a tibble if this PR gets accepted.

Removing/hiding non-vetorized classes

We could remove/hide the Hierarchy and Taxon class in favor of the Taxa and Hierarchies classes, since the plural forms can do everything the singular forms can. We can also rename the plural form to the singular to be more in R style (character is not characters), so the inheritance hierarchy would become:

TaxonName
TaxonRank
TaxonId
TaxonDatabase
Taxon
└─Taxonomy
  ├─Taxmap
  └─Hierarchy

Multiple names/ranks/ids per taxon

@kamapu pointed out that taxa does not support synonyms. The Taxmap is flexible enough that it can encode synonyms, but the other classes cannot. If we allow multiple TaxonName, TaxonRank, and TaxonId objects, in Taxon objects (or each item in a Taxa object), then that would allow for disagreements among multiple sources of taxonomic information (same taxon from multiple databases). We would then need to define methods like ==.Taxon/==.Taxa and as.character.Taxon to account for this, but that is doable.

Whew, that was a lot of stuff. I put it all in one issues since many of the changes are related and dependent on eachother, but if we decide to go this route, then I will split it up into multiple issues and make a milestone.

I would appreciate feedback from anyone who sees this, especially @sckott.

sckott commented 5 years ago

thanks @zachary-foster - it's complex so I'll take a bit to make sure I understand all of this. For my immediate concern about looking at the work on the eval branch, will the changes proposed above conflict with the eval branch work? If so, i guess I'd hold off on incorporating taxa in taxize

zachary-foster commented 5 years ago

it's complex so I'll take a bit to make sure I understand all of this.

Yea, its a big change conceptually, but I think the amount code change needed will not be too big though. The summary is: every class would hold multiple items and all complex classes (Taxonomy, Hierarchy, Taxmap) would inherit Taxon (which is currently Taxa) to reduce code redundancy.

will the changes proposed above conflict with the eval branch work

The changes would build on the eval branch work. For example, in eval, Taxonomy already inherits Taxa, but that had no effect on how the class is used, just reduced code redundancy for the getters/setters. From a user perspective, the only big changes that would be the removal of the Taxon and Hierarchy classes as they are now (Taxa would be renamed Taxon, etc).

If so, i guess I'd hold off on incorporating taxa in taxize

Perhaps, it depends what parts of taxa you would be using most. Do you want to do a conference call to discuss the needs of taxize and so I can explain the changes with more context?

sckott commented 5 years ago

Yes, a chat would be good. i'll email you to schedule that

kamapu commented 5 years ago

Dear people, sorry for absence: I'm just back from Kenya. Since I'm not into R6, I doubt, I can make any relevant contribution on this topic. But maybe discuss about the issue of synonyms:

I think, synonyms have been the mid-way solution to harmonize nomenclatures in vegetation-plot databases, and maybe also in taxonomy. Perhaps you are right, a taxon concept applied according to a taxon view have only one way to be called, namely using its accepted name. Nobody referring to the same taxon concept after the same taxon view will use a synonym for calling the taxon. On the other side, two taxon concepts from different taxon views may be equivalent, independent whether they have the same usage name or different ones.

Thus using a taxon view will not really require to apply synonyms but to establish relations between different taxon concepts that may be non-related at all, partially related or equivalents. On this regard, a taxon concept included in another "bigger" taxon may share the same taxonomic rank but be defined by different taxon views. This is tricky, because the same relation can occur between parent and child. Moreover, one taxon concept may share parts with more than one concept in another taxon view. And so forth.

Anyway, this is how I understand nomenclature and taxonomy following these references:

Now may questions: 1) Is taxa considering taxon views? May (or may not) be the database the taxon view? 2) I know, it sounds crazy but may nomenclatorial relationships (accepted name vs. synonyms) be treated the same as taxonomical relationships (parent vs. children) in the context of biodiversity information?

I was serously thinking to attempt a sort of taxlist2 object for enabling "multiple taxon views" but I'm facing those concerns. Of course, I don't have any urgency for a solution but it seems to be that the topic had been discussed for around 25 years without having a proper answer.

zachary-foster commented 5 years ago

Thanks for the thoughts @kamapu!

I'm just back from Kenya.

Cool! Sounds likes an interesting place to do field work in.

a taxon concept applied according to a taxon view have only one way to be called, namely using its accepted name. Nobody referring to the same taxon concept after the same taxon view will use a synonym for calling the taxon. On the other side, two taxon concepts from different taxon views may be equivalent, independent whether they have the same usage name or different ones.

I think I understand. Synonyms are easy enough to deal with, but homonyms and taxa that overlap in what they include are difficult to encode. After thinking about this a bit, I am leaning toward thinking of a "taxon" in the general sense as an unknown, and unknowable idealized grouping. In order to encode how conflicting taxon concepts compare, we would need some kind of unique ID for the actual individual organisms we are trying to organize, such as an herbarium voucher. Such information is not usually available and would change over time due to evolution of organisms and changes in our understanding of biology. Therefore, rather than trying to defined exactly what the taxon is and how it related to other taxa, I am leaning towards a flexible data model that will give the user the freedom to apply it as they like.

Anyway, this is how I understand nomenclature and taxonomy following these references:

Thanks! I read both these and now I have a better idea of what you mean by taxon view. Its a useful concept.

Is taxa considering taxon views?

I am after reading those papers. I dont know scott's opinion on it. I don't think most users will need that level of detail, but it would be invaluable to a select few and might facilitate some interesting research and tools.

May (or may not) be the database the taxon view?

Its kindof functions as a taxon view, in the sense that it is an authority to reference, however, its pretty specific to online databases at the moment and since these are not the ultimate source of taxonomic authority, the concept is different from a taxon view.

I know, it sounds crazy but may nomenclatorial relationships (accepted name vs. synonyms) be treated the same as taxonomical relationships (parent vs. children) in the context of biodiversity information?

hmm, I am not sure what you mean by that. Like, taxon views that reference and adapt other taxon views moving forward in time can be thought of as a tree of relationships? A taxon view family tree? Similar to the "Operational trees" in Zhong et al. (1996)?

Of course, I don't have any urgency for a solution but it seems to be that the topic had been discussed for around 25 years without having a proper answer.

Haha! Yea, its a tough problem and not really interesting to most people, even though it has big implications for interpreting historical information digitally.

I wrote a bunch of confusing stuff below, mostly for my own notes, so don't feel obliged to read it or respond to it unless you want to. I would of course welcome feedback if you are so inclined : )

Here is the best model I have so far if we really want to try to encode all this confusion:

Below is how such a model might be implemented. However, it is excessively complicated for 95% of what people typically need to do with taxonomic data, so I am not proposing it for taxa at this time. Something to think about perhaps though. I would consider it for taxa if it could appear simple to the user who does not need to think about conflicting taxonomies. All of the objects below could be singular or plural (i.e., "vectorized"). The descriptions below are worded as if they are singular.

A set of interlinked taxon objects would then be a taxonomy. There would be multiple paths from a root to a tip in such a taxonomy, representing conflicting classifications. The tips would represent whatever we are trying to group with the taxonomy. Alternatively, the taxon_link objects could be replaced with an edge list with an extra column of taxon_view, which would be similar to how the current taxonomy class is set up.

kamapu commented 5 years ago

I did not like to insert it here, but there are some brief discussions here.

zachary-foster commented 5 years ago

Thats fine @kamapu, I will take a look. thanks

zachary-foster commented 5 years ago

Hi @sckott, want to take a look at the vectorize branch and see what you think about the new taxon_db, taxon_id, taxon_name, taxon_rank, and taxon vectors? I think they are working for the most part. The tests and vignettes have not been updated, but the man pages have examples that should give you an idea of how they work. The exciting part for me is that they can now be used as columns in tables and work with most base R functions.

sckott commented 5 years ago

@zachary-foster yep, i'll take a look ..

sckott commented 5 years ago

sorry for the delay! just back from vacation.

Trying to install vectorize branch and getting errors related to vctrs - same problem with the dev and cran version of vctrs. Any thoughts?

zachary-foster commented 5 years ago

Hi Scott, no problem, I hope the vacation went well!

It should be working now. There was a function renamed in vctrs since I last installed. I fixed it and now I can install using the github version of vctrs

sckott commented 5 years ago

Looks good for the most part.

is this intended behavior?

x <- taxon(name = c('Homo sapiens', 'Bacillus', 'Ascomycota', 'Ericaceae'),
   rank = c('species', 'genus', 'phylum', 'family'),
   id = taxon_id(c('9606', '1386', '4890', '4345'), db = 'ncbi'),
   auth = c('Linnaeus, 1758', 'Cohn 1872', NA, 'Juss., 1789'),
   info = list(list(n = 1), list(n = 3), list(n = 2), list(n = 9)))
x[taxon_rank(x) > 'family']
#> <taxon[2]>
#> [1] 9606|Homo sapiens Linnaeus, 1758|species 1386|Bacillus Cohn 1872|genus
#> Rank levels: phylum < family < genus < species
#> Databases: ncbi(id)
#> Info keys: n(2)

Looks like the eg returns ranks less than family, not greater than. Or am I interpreting wrong what the example does?

zachary-foster commented 5 years ago

Nice

Its intended, but we could do it the other way if that is more intuitive. The way I set it up, high numbers indicate more fine scale ranks so species is greater than genus. I think I did that so that it looks like a factor when printing while displaying the ranks phylum -> species vs species -> phylum:

#> Rank levels: phylum < family < genus < species

vs

#> Rank levels:  species < genus < family < phylum

But if think the oppisit ordering convertion is more intuitive, we could print it like:

#> Rank levels: phylum > family > genus > species

Then specifying the ranks manually would look like:

taxon_rank(c('A', 'B', 'C', 'D', 'A', 'D', 'D'),
           levels = c(D = NA, A = 30, B = 20, C = 10))

instead of (how it is currently):

taxon_rank(c('A', 'B', 'C', 'D', 'A', 'D', 'D'),
           levels = c(D = NA, A = 10, B = 20, C = 30))

Here are the numbers corresponding to the ranks currently (if not specified by the user):

> taxa:::rank_ref
          domain     superkingdom          kingdom       subkingdom      superphylum     infrakingdom 
              10               20               30               40               50               50 
          phylum         division        subphylum      subdivision    infradivision       superclass 
              60               60               70               70               80               90 
           class         subclass       infraclass       megacohort      supercohort           cohort 
             100              110              120              130              140              150 
       subcohort      infracohort       superorder            order         suborder       infraorder 
             160              170              180              190              200              210 
       parvorder      superfamily           family        subfamily            tribe         subtribe 
             220              230              240              250              260              270 
           genus         subgenus          section       subsection    species group species subgroup 
             280              290              300              310              320              330 
         species     infraspecies       subspecies          variety         varietas       subvariety 
             340              350              360              370              370              380 
            race            stirp             form            forma            morph       aberration 
             380              390              400              400              400              410 
         subform      unspecified          no rank            clade 
             420               NA               NA               NA 

We could make the number decrease going from phylum to species if that is more intuitive

sckott commented 5 years ago

Ah okay, right we also have the same ordering in the rank_ref dataset in taxize. That makes sense then

zachary-foster commented 5 years ago

Ok, cool. That list is mostly derived from taxize::rank_ref, mostly just reformatted

zachary-foster commented 5 years ago

What do you think of adding support for synonyms to the taxon class? It could be a list of taxon_name vectors. There could either be a separate list for synonyms or we could replace the taxon_name vector in the taxon class with a list of taxon_name vectors and the first in the list could be the "accepted" name.

Also should the authorities be associated with the taxon or taxon_name class? If there were synonyms, would it be better to move it to the taxon_name class so that each synonym could have a different authority?

It would be nice to modify %in% and == so that:

'Bacillus' %in% x

would return TRUE when any synonym is matched.

sckott commented 5 years ago

makes sense to add synonyms. isn't it possible that a user would creat a taxon class that is a name or set of names that are not accepted names?

I think each synonym should be able to have a different authority for sure. - they're just like any other name i guess, with their own authority, id, etc.

I can't quite visualize how this will work since taxon class can have an arbitrary set of taxa in it, so where would you organize a set of names that are related (in this case synonyms of one another)

zachary-foster commented 5 years ago

isn't it possible that a user would create a taxon class that is a name or set of names that are not accepted names?

Yea, I guess i was thinking more about how they were stored in the object. So this:

name = c('A', 'B', 'C')
synonym = list(c('D', 'E'), character(0), character(0))

vs

name = list(c('A', 'D', 'E'), 'B', 'C')

After thinking about it, either would work. When I said "accepted" I was thinking the one that would appear in the print out and the output of as.character and such. So c('A', 'B', 'C') in the above example.

I think each synonym should be able to have a different authority for sure. - they're just like any other name i guess, with their own authority, id, etc.

Hmm, thats a good point. I had thought each should have their own authority, but yea, they could also have different ranks, which would make things more complicated. Do synonyms ever have different IDs in databases? If so, then synonym would have to be defined using taxon objects rather than taxon_name.

I can't quite visualize how this will work since taxon class can have an arbitrary set of taxa in it, so where would you organize a set of names that are related (in this case synonyms of one another)

I was thinking of just having multiple names by adding a list of taxon_name vectors of equal length to the number of taxa. So for each taxon, in addition to its name, there would be a vector of synonym names. but if each synonym can have a rank/id/authority, then that wont work.

It seems like there are two different things that we are trying to encode here:

  1. A taxon name/id/rank that appears in a database or dataset. This just a piece of data, separate from any particular taxonomic concept. It can have an ID, authority, and rank. This is what is what is usually used in analyses where taxonomy info is not the focus of the research, but just another way to annotate data
  2. The idea of a particular taxon, which might have synonyms and appear in multiple databases in different ways. Each synonym might have its own rank, authority, id, and database.This is the thing people usually mean when they talk about a particular taxon.

What do you think about these set of changes to support these two concepts:

kamapu commented 5 years ago

but if each synonym can have a rank/id/authority

Perhaps I'm not that in into technical properties of taxa to deliver a proper opinion, but I'm wondering why a synonym or name have to be associated to a taxonomic rank. In taxlist, the taxonomic rank (level) is a property of the taxon concept (slot taxonRelations) and not of the taxon usage names (slot taxonNames).

In my opinion, there are two solutions to handle taxonomic lists, enabling their connection to observations and/or specimens:

  1. A single opinion/taxon concept approach as used in the taxlist. This is the most pragmatical approach but less flexible one. We assume, there is always one, most accepted opinion for the circumscription of a taxon. Further opinions will be then considered by using synonyms. The big and many times hidden problem on this approach is that the reasons how a name becomes synonym are diverse and in some cases two taxa may share a synonym.
  2. A multiple taxon concept approach were every opinion have the same weight. In this case, I assume synonyms does not make sense because every name will be associated to one taxon concept according to a taxon view. Do not forget that the same name (same name combination with same authority) may be associated to different taxon concepts by different taxon views. Anyway, in this case the taxon concepts will be related to each other in terms of shared space between circumscriptions.

Therefore:

zachary-foster commented 5 years ago

Thanks for the input @kamapu!

I'm wondering why a synonym or name have to be associated to a taxonomic rank.

I think it makes sense that a synonym can have a different rank, although I have not thought about it much before. You see it often below/at the species rank:

taxize::synonyms('homo sapiens', db = 'col')[[1]][1:10, 1:3]
#> ══  1 queries  ═══════════════
#> 
#> Retrieving data for taxon 'homo sapiens'
#> ✔  Found:  homo sapiens
#> ══  Results  ═════════════════
#> 
#> ● Total: 1 
#> ● Found: 1 
#> ● Not Found: 0
#>                                  id                            name
#> 1  27ef25a84b47c43bdf06e027d05e1e06                Homo aethiopicus
#> 2  475b397e5b049d4d6510e289036b7dcb                 Homo americanus
#> 3  c209900e7e04b5a67b8e3246e5215c66                   Homo arabicus
#> 4  5d326b11b8bb53e1394c8c40cef4aa97              Homo australasicus
#> 5  a1723fb1a1b2d6d3e09f1ec6bae7fe9b                      Homo cafer
#> 6  1d92d3b9a7c4c85333d330a5d404e401                   Homo capensis
#> 7  895d6bd6e6ef0e5bf2269f8df8cf05cb                 Homo columbicus
#> 8  2cb9d2a4281f3d88733d0e79940f01c3                   Homo drennani
#> 9  00d994548f2f2e12980d4843d55d74e0 Homo fossilis proto-aethiopicus
#> 10 f85224c601bb5d16719b1c909d198531  Homo fossilis protoaethiopicus
#>            rank
#> 1       species
#> 2       species
#> 3       species
#> 4       species
#> 5       species
#> 6       species
#> 7       species
#> 8       species
#> 9  infraspecies
#> 10 infraspecies

Created on 2019-07-22 by the reprex package (v0.3.0)

I dont think there is a problem with synonyms having ranks, as long as the all the taxon ranks in a taxon concept are more fine-scale than all the ranks in the parent taxon concept.

A single opinion/taxon concept approach

Since much of the information that will be used with taxa will come for databases with diverse assumptions, I agree this approach might not fit all types of data well.

A multiple taxon concept approach

This is what I am going for, but there will be a kind of "first among equals" behavior when functions need a single taxon name (e.g. labeling a taxonomic tree). I was thinking the first synonym is the "representative" taxon name, but does not have more weight than the others in most operations.

I assume synonyms does not make sense because every name will be associated to one taxon concept according to a taxon view.

The way I was thinking of it, there could still be multiple synonyms in a single taxon concept. Basically, if they have the same child taxa, they are synonyms. In the way I would encode it, F. ovina L. in Figure 1 would all be different taxon concepts (homonyms) since they have different child taxa. The two F. ovina agg. would also be different taxon concepts, but they would be the same concept (synonyms) if the placement of F. guestfalica did not differ between them.

Anyway, in this case the taxon concepts will be related to each other in terms of shared space between circumscriptions.

Is this the same as saying "if they have the same child taxa, they are synonyms"?

taxonomic rank almost irrelevant for defining relationships between taxon concepts.

Yea, I agree, its just an extra piece of information. taxa will not require taxon ranks to work. They are there to make subsetting easier for the most part.

For the second approach, we can use Fig. 1 in Jansen & Dengler (2010). There the circumscription of Festuca ovina L. sec. Jäger & Werner (2005) is equivalent to the circumscription of Festuca ovina ssp. vulgaris var. vulgaris sec. Ascherson (1864). Thus, the taxonomic relationship between the second taxon concept and F. ovina L. sec. Ascherson (1864) due to taxonomic ranks is the same relationship as with the first taxon concept due to different views.

I am not sure I understand this. "Festuca ovina L. sec. Jäger & Werner (2005)" = "Festuca ovina ssp. vulgaris var. vulgaris sec. Ascherson (1864)"?

I know, the second approach is the most realistic and far more flexible one but it will make taxonomic data handle a real inferno.

It will get more complex, since taxon concepts can then have multiple parents. I think its doable. The hard part will be to make it easy to understand for the user when there are no synonyms, since most users will not need to consider synonyms. Basically, I want new users to be unaware synonyms are even supported until they need that functionality.

sckott commented 4 years ago

@zachary-foster looking at this in taxize:

get_ fxns

I'm not sure the vectorize branch as is will work. I can't feasibly attach attributes on the output of get_ fxns and not lose them. e.g. I can just add attributes on a taxon_name object, whic works. It's great that taxon_name objects can be combined with c(), but then the attributes are lost. That was the case before too, but trying to avoid attributes being lost now. And I'd need to create a new version of as_tibble.taxa_taxon_name to be able to add attributes to the output data.frame, which doesn't seem like a great idea to override the taxa pkg method. I know you said you didn't want to deal with attributes though, so I'm not sure where that leaves us?

I might play around with creating a class on top of taxa::taxon_name that will be able to hold attributes, be coerced to a data.frame, and be passed through to other fxns easily

taxa::classification()

taxa::tax_name()/taxa::tax_rank()

Same concern about classifcation applies here wrt overriding taxize fxn names if taxa is loaded after taxize.

zachary-foster commented 4 years ago

Hi @sckott,

That was the case before too, but trying to avoid attributes being lost now.

The vctrs package should allow this. I must have overridden some method that breaks this for c.

x <- vctrs::new_vctr(1:10, my_attr = 'result', class = 'test_class')
y <- vctrs::new_vctr(1:10, my_attr = 'different result', class = 'test_class')
attributes(x)
#> $my_attr
#> [1] "result"
#> 
#> $class
#> [1] "test_class" "vctrs_vctr"
attributes(x[1])
#> $my_attr
#> [1] "result"
#> 
#> $class
#> [1] "test_class" "vctrs_vctr"
attributes(c(x, x))
#> $my_attr
#> [1] "result"
#> 
#> $class
#> [1] "test_class" "vctrs_vctr"
attributes(c(x, y))
#> No common type for `..1` <test_class> and `..2` <test_class>.

Created on 2019-10-24 by the reprex package (v0.3.0)

Note how this does not work if the attributes have different values. For example, I had to overwrite c.taxa_taxon_name to make the taxon ranks be combined correctly, since they are stored in an attribute.

If we can come up with a general rule for combining attributes with different values, I can probably make it work. I can add a ... to define attributes in the taxa constructors using the vctrs functionality and add a c method to preserve the attributes how we choose.

I know you said you didn't want to deal with attributes though, so I'm not sure where that leaves us?

I was talking about the per-value attributes (e.g. some info on each taxon). I had not considered per-vector/object attributes before. I am fine with adding either, I was just saying that per-value attributes might be better handled by columns in a table containing a taxa vector.

I might play around with creating a class on top of taxa::taxon_name that will be able to hold attributes

That was the strategy I was going to suggest when a specific attribute needs a specific way of being combined. Either way, making the taxa vector preserve attributes as much as possible would be a good idea.

One problem here is that this fxn is the same name as taxize has.

Good point, I should have noticed that. Perhaps we should change its name back to hierarchy.

what are instances and are they needed?

The instances are the index of a taxon in the taxonomy. In the classification print out, they are the most fine-scale taxa (i.e. the leafs).

how would you propose to construct a taxa::classification object from the current objects we return for ...

We would use the algorithm from parse_tax_data in the current taxa. I haven't adapted it yet for classification, but I imagine some alternative constructor built on it that takes a list of anything that can be coreiced into a list of taxon vectors, where each vector is the classification for a single taxon.

library(taxa)
#> 
#> Attaching package: 'taxa'
#> The following object is masked from 'package:base':
#> 
#>     %in%
x <- taxize::classification(c(129313, 129310), db = 'itis')
x
#> $`129313`
#>                   name         rank     id
#> 1             Animalia      kingdom 202423
#> 2            Bilateria   subkingdom 914154
#> 3          Protostomia infrakingdom 914155
#> 4            Ecdysozoa  superphylum 914158
#> 5           Arthropoda       phylum  82696
#> 6             Hexapoda    subphylum 563886
#> 7              Insecta        class  99208
#> 8            Pterygota     subclass 100500
#> 9             Neoptera   infraclass 563890
#> 10        Holometabola   superorder 914213
#> 11             Diptera        order 118831
#> 12          Nematocera     suborder 118832
#> 13        Culicomorpha   infraorder 125808
#> 14        Chironomidae       family 127917
#> 15        Chironominae    subfamily 129228
#> 16         Chironomini        tribe 129229
#> 17          Chironomus        genus 129254
#> 18 Chironomus riparius      species 129313
#> 
#> $`129310`
#>                name         rank     id
#> 1          Animalia      kingdom 202423
#> 2         Bilateria   subkingdom 914154
#> 3       Protostomia infrakingdom 914155
#> 4         Ecdysozoa  superphylum 914158
#> 5        Arthropoda       phylum  82696
#> 6          Hexapoda    subphylum 563886
#> 7           Insecta        class  99208
#> 8         Pterygota     subclass 100500
#> 9          Neoptera   infraclass 563890
#> 10     Holometabola   superorder 914213
#> 11          Diptera        order 118831
#> 12       Nematocera     suborder 118832
#> 13     Culicomorpha   infraorder 125808
#> 14     Chironomidae       family 127917
#> 15     Chironominae    subfamily 129228
#> 16      Chironomini        tribe 129229
#> 17       Chironomus        genus 129254
#> 18 Chironomus prior      species 129310
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "itis"
y <- lapply(x, function(y) taxon_name(name = y$name, rank = y$rank, id = y$id))
y
#> $`129313`
#> <taxon_name[18]>
#>  [1] 202423|Animalia|kingdom            914154|Bilateria|subkingdom       
#>  [3] 914155|Protostomia|infrakingdom    914158|Ecdysozoa|superphylum      
#>  [5] 82696|Arthropoda|phylum            563886|Hexapoda|subphylum         
#>  [7] 99208|Insecta|class                100500|Pterygota|subclass         
#>  [9] 563890|Neoptera|infraclass         914213|Holometabola|superorder    
#> [11] 118831|Diptera|order               118832|Nematocera|suborder        
#> [13] 125808|Culicomorpha|infraorder     127917|Chironomidae|family        
#> [15] 129228|Chironominae|subfamily      129229|Chironomini|tribe          
#> [17] 129254|Chironomus|genus            129313|Chironomus riparius|species
#> Rank levels: kingdom < subkingdom < infrakingdom = superphylum < phylum < subphylum < class < subclass < infraclass < superorder < order < suborder < infraorder < family < subfamily < tribe < genus < species
#> 
#> $`129310`
#> <taxon_name[18]>
#>  [1] 202423|Animalia|kingdom         914154|Bilateria|subkingdom    
#>  [3] 914155|Protostomia|infrakingdom 914158|Ecdysozoa|superphylum   
#>  [5] 82696|Arthropoda|phylum         563886|Hexapoda|subphylum      
#>  [7] 99208|Insecta|class             100500|Pterygota|subclass      
#>  [9] 563890|Neoptera|infraclass      914213|Holometabola|superorder 
#> [11] 118831|Diptera|order            118832|Nematocera|suborder     
#> [13] 125808|Culicomorpha|infraorder  127917|Chironomidae|family     
#> [15] 129228|Chironominae|subfamily   129229|Chironomini|tribe       
#> [17] 129254|Chironomus|genus         129310|Chironomus prior|species
#> Rank levels: kingdom < subkingdom < infrakingdom = superphylum < phylum < subphylum < class < subclass < infraclass < superorder < order < suborder < infraorder < family < subfamily < tribe < genus < species
my_classification <- alternate_constructor(y)
#> Error in alternate_constructor(y): could not find function "alternate_constructor"

Created on 2019-10-24 by the reprex package (v0.3.0)

Same concern about classifcation applies here wrt overriding taxize fxn names

Yea, will have to do something about that. Either change the prefix for all the taxa getters/setters (perhaps tx_*) or change taxize functions to lookup_name and lookup_rank or something.

zachary-foster commented 4 years ago

Just looked at the get_ fxns again and I noticed that all the attributes are per-taxon values. I can see three ways of dealing with this, from easiest to hardest:

1) What about returning tables instead? Like

library(taxa)
#> 
#> Attaching package: 'taxa'
#> The following object is masked from 'package:base':
#> 
#>     %in%
library(taxize)
#> 
#> Attaching package: 'taxize'
#> The following objects are masked from 'package:taxa':
#> 
#>     classification, tax_name, tax_rank
library(tibble)
x = get_tsn(c("Chironomus riparius","Quercus douglasii"))
#> ══  2 queries  ═══════════════
#> 
#> Retrieving data for taxon 'Chironomus riparius'
#> ✔  Found:  Chironomus riparius
#> 
#> Retrieving data for taxon 'Quercus douglasii'
#> ✔  Found:  Quercus douglasii
#> ══  Results  ═════════════════
#> 
#> ● Total: 2 
#> ● Found: 2 
#> ● Not Found: 0
tibble(id = taxon_id(as.character(x), db = 'itis'), !!! attributes(x))
#> # A tibble: 2 x 6
#>   id            class match multiple_matches pattern_match uri             
#>   <tax_id>      <chr> <chr> <lgl>            <lgl>         <chr>           
#> 1 129313 (itis) tsn   found FALSE            FALSE         https://www.iti…
#> 2 19322 (itis)  tsn   found FALSE            FALSE         https://www.iti…

Created on 2019-10-24 by the reprex package (v0.3.0)

2) build a vctrs record class that with the same fields as taxon_id, plus some extra ones. You would have to make methods for some of the coercion functions and perhaps others.

library(taxa)
#> 
#> Attaching package: 'taxa'
#> The following object is masked from 'package:base':
#> 
#>     %in%
library(taxize)
#> 
#> Attaching package: 'taxize'
#> The following objects are masked from 'package:taxa':
#> 
#>     classification, tax_name, tax_rank
library(tibble)
library(vctrs)

# Constructors
new_my_class <- function(id = character(), db = taxon_db(), match = match, .names = NULL) {
  vctrs::new_rcrd(list(.names = .names, id = id, db = db, match = match), 
                  .names_set = FALSE, # needed for taxon_id methods to work
                  class = c("taxize_my_class", 'taxa_taxon_id'))
}

my_class <- function(id = character(), db = NA, match = NA, .names = NULL) {
  if (is.null(.names)) {
    .names <- NA_character_
  }
  .names <- vctrs::vec_cast(.names, character())
  id <- vctrs::vec_cast(id, character())
  db <- vctrs::vec_cast(db, taxon_db())
  match <- vctrs::vec_cast(match, character())
  c(id, db, match, .names) %<-% vctrs::vec_recycle_common(id, db, match, .names)
  new_my_class(.names = .names, id = id, db = db, match = match)
}

# S3 coercion functions
vec_ptype2.taxize_my_class <- function(x, y, ...) UseMethod("vec_ptype2.taxize_my_class", y)
vec_ptype2.taxize_my_class.default <- function(x, y, ..., x_arg = "", y_arg = "") {
  vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
}
vec_ptype2.taxize_my_class.vctrs_unspecified <- function(x, y, ...) x
vec_ptype2.taxize_my_class.taxize_my_class <- function(x, y, ...) my_class()

x = get_tsn(c("Chironomus riparius","Quercus douglasii"))
#> ══  2 queries  ═══════════════
#> 
#> Retrieving data for taxon 'Chironomus riparius'
#> ✔  Found:  Chironomus riparius
#> 
#> Retrieving data for taxon 'Quercus douglasii'
#> ✔  Found:  Quercus douglasii
#> ══  Results  ═════════════════
#> 
#> ● Total: 2 
#> ● Found: 2 
#> ● Not Found: 0
x
#> [1] "129313" "19322" 
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found" "found"
#> attr(,"multiple_matches")
#> [1] FALSE FALSE
#> attr(,"pattern_match")
#> [1] FALSE FALSE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=129313"
#> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322"
y = my_class(as.character(x), db = 'itis', match = attr(x, 'match'))
y
#> <taxon_id[2]>
#> [1] 129313 (itis) 19322 (itis)
class(y)
#> [1] "taxize_my_class" "taxa_taxon_id"   "vctrs_rcrd"      "vctrs_vctr"
is_taxon_id(y)
#> [1] TRUE
vctrs::fields(y)
#> [1] ".names" "id"     "db"     "match"
vctrs::field(y, 'match')
#> [1] "found" "found"
vctrs::field(y[1], 'match')
#> [1] "found"
vctrs::field(c(y, y), 'match')
#> [1] "found" "found" "found" "found"

Created on 2019-10-24 by the reprex package (v0.3.0)

3) Allow arbitrary per-value attributes as a list field in taxa classes. I did experiment with have per-taxon attributes stored in a named list for each taxon, but eventually thought just including that info in table would be more straight-forward. It also might add a lot of overhead; i am not sure. We can try that rout again though

sckott commented 4 years ago

Sorry about the long delay on this @zachary-foster - had a bunch of new versions of other pkgs to get out.

Thanks for the good brainstorming. I'll play around with these and get back to you. My gut reaction is that I don't want to return tables from get_* fxns. It just doesn't seem like the right data structure, but maybe after trying it i'll change my mind.

zachary-foster commented 4 years ago

No problem, I have been busy with traveling anyway.

I played around with this some more, and I think option 2 would probably be the most similar to how taxize already works. That way you could have per-value attributes (as fields in rcrd-style vectors) that are preserved when subsetting and combining as well as per-vector attributes that are preserved when subsetting (vctrs does both automatically). If you want users to see the attributes in the print output, you would need to add print methods. Either way, users could see and manipulate them with vctrs::field.

sckott commented 4 years ago

sounds good - i'll have a look at option 2

sckott commented 4 years ago

Sorry about another long delay on this @zachary-foster - Are you still planning to move to using vctrs?

zachary-foster commented 4 years ago

Sorry for the delayed response. I have been on vacation after defending my PhD and was not checking my email much. I am back to work now.

Yea, I am still trying to get vctrs to work. It was working nicely for the most part, although If I remember correctly a few obscure bugs and missing features in vctrs was causing some issues. Those might be fixed by now

zachary-foster commented 4 years ago

Hi @sckott, I have made some good progress of the vectorize branch. If you have the time, I would like to go over it with you to get your opinions before I continue. Here is a summary of what I did since we last discussed it:

All the classes besides classification are working as far as I can tell. taxonomy needs more testing, since it is more complicated, but seems to work. Once we decide on how taxonomy should work ideally, I can work on finishing classification and adding taxmap, which would be the last class needed.

sckott commented 4 years ago

Sounds good - did you want to do a call?

zachary-foster commented 4 years ago

Yea, that would be good. Does zoom work for you?

sckott commented 4 years ago

Yeah, what day/time?

sckott commented 4 years ago

@zachary-foster Tried using taxa::taxon in taxize. Works nicely so thats cool. But main problem is the attributes aren't subsetted along with id/name/rank/auth (and when coerced to data.frame the attributes are lost; would be nice to have the attributes I add go in to the data.frame, especially the URI for the taxon). Maybe instead of changing taxa to meet taxize's needs, I could extend taxon in taxize?

zachary-foster commented 4 years ago

Yea, I think you could extend taxon in taxize. I did some quick tests, and it seems possible, although you might have to add a few custom methods as problems arise. Here is a proof of concept:


new_taxize_taxon <- function(.names = NULL, name = character(), rank = taxon_rank(), id = taxon_id(), auth = taxon_authority(), new_col = character(), ...) {

  # Set names to NA if not set
  if (is.null(names) || all(is.na(.names))) {
    .names_set <- FALSE
    .names <- vctrs::vec_recycle(NA_character_, length(name))
  } else {
    .names_set <- TRUE
    vctrs::vec_assert(.names, ptype = character())
  }

  # Check that values are the correct type
  vctrs::vec_assert(name, ptype = character())
  # vctrs::vec_assert(rank, ptype = taxon_rank())
  vctrs::vec_assert(id, ptype = taxon_id())
  vctrs::vec_assert(auth, ptype = taxon_authority())
  vctrs::vec_assert(new_col, ptype = character())

  # Create new object
  vctrs::new_rcrd(list(.names = .names, name = name, rank = rank, id = id, auth = auth, new_col = new_col),
                  .names_set = .names_set,
                  ...,
                  class = c("taxize_test", "taxa_taxon"))
}

taxize_taxon <- function(name = character(0), rank = NA, id = NA, auth = NA, .names = NA, new_col = NA, ...) {
  # Cast inputs to correct values
  name <- vctrs::vec_cast(name, character())
  rank <- vctrs::vec_cast(rank, taxon_rank())
  id <- vctrs::vec_cast(id, taxon_id())
  auth <- vctrs::vec_cast(auth, taxon_authority())
  new_col <- vctrs::vec_cast(new_col, character())
  .names <- vctrs::vec_cast(.names, character())

  # Recycle ranks and databases to common length
  recycled <- vctrs::vec_recycle_common(name, rank, id, auth, new_col, .names)
  name <- recycled[[1]]
  rank <- recycled[[2]]
  id <- recycled[[3]]
  auth <- recycled[[4]]
  new_col <- recycled[[5]]
  .names <- recycled[[6]]

  # Create taxon object
  new_taxize_taxon(.names = .names, name = name, rank = rank, id = id, auth = auth,  new_col = new_col, ...)
}

vec_cast.taxize_test <- function(x, to, ..., x_arg, to_arg) UseMethod("vec_cast.taxize_test")

vec_cast.taxize_test.default <- function(x, to, ..., x_arg, to_arg) vctrs::vec_default_cast(x, to, x_arg, to_arg)

vec_cast.taxize_test.taxize_test <- function(x, to, ..., x_arg, to_arg) x

vec_cast.taxize_test.character <- function(x, to, ..., x_arg, to_arg) taxon(x)

vec_cast.character.taxize_test <- function(x, to, ..., x_arg, to_arg) as.character(vctrs::field(x, "name"))
x <- taxize_test(name = c('Homo sapiens', 'Bacillus', 'Ascomycota', 'Ericaceae'),
                 rank = c('species', 'genus', 'phylum', 'family'),
                 id = taxon_id(c('9606', '1386', '4890', '4345'), db = 'ncbi'),
                 auth = c('Linnaeus, 1758', 'Cohn 1872', NA, 'Juss., 1789'),
                 new_col = c('A', 'B', 'C', 'D'))
names(x) <- c('a', 'b', 'c', 'd')
vctrs::field(x[2:3], 'new_col')
sckott commented 4 years ago

thanks @zachary-foster - i;ll try that

sckott commented 4 years ago

I've used your example code above and built on it. work on this branch https://github.com/ropensci/taxize/compare/taxa-work Some thoughts:

zachary-foster commented 4 years ago

right now, the db checker in taxon_id() doesn't account for when there's no ID.

Good find. I fixed it so now NA is a valid ID regardless of database.

but got an error using the setter, opened an issue #207

That should be fixed now

i tried settings names via .names but it's a bit tricky done programmatically b/c there can be any number of taxon ids not found, so i can document for users how to set names themselves if they want names on the output

I am not sure I understand. Even if a taxon ID is not found is NA returned? Is the input length/order and output length/order different if there are unknown taxon IDs?

sckott commented 4 years ago

thanks for the fixes.

for the names applied via .names - yes, NA is returned when there is no ID found in the taxize get_* functions. But if there's NA's I think it may not make sense to give the output taxon object vector names since you can't subset a vector AFAIK with NA's as the indices, and if there's more than one NA, then how do you index to the different NA's. So I think for taxize I lean towards not supplying names to the output vectors, but show the user how to do it if they want to

There seems to be an error in the print method when there is an NA, e.g., (or maybe that is correct, expected behavior)

x <- taxon(c('A', 'B', 'C'), .names = c('d', 'e', NA_character_))
x
#> <taxon[3]>
#> Error: If any elements are named, all elements must be named.
rlang::last_error()
<error/rlang_error>
If any elements are named, all elements must be named.
Backtrace:
  1. (function (x, ...) ...
  2. vctrs:::print.vctrs_vctr(x)
  3. vctrs::obj_print(x, ...)
  5. taxa:::obj_print_data.taxa_taxon(x, ...)
  6. taxa:::printed_taxon(x, color = TRUE)
  7. taxa:::font_tax_name(x)
 12. taxa:::tax_rank.taxa_taxon(text)
 13. taxa:::named_field(x, "rank")
 15. vctrs:::`names<-.vctrs_vctr`(`*tmp*`, value = names(x))
Run `rlang::last_trace()` to see the full context.
zachary-foster commented 4 years ago

How about naming the output by the input, assuming the input is one-to-one with the output and is a character vector?

That error seems to be due to vctrs not allowing vctr objects to be named with NA. It was actually taxon_rank causing the problem. taxon_rank is a vcrt object whereas most the others are rcrd, for which I have custom naming code that allows NA. From what I gather reading issues on the vctrs github, the vctrs developers seem undecided about allowing NA or "" to be valid names

I fixed the issue for taxon in the mean time

sckott commented 4 years ago

Thanks for the fix.

Could name by the inputs, might try that

sckott commented 4 years ago

@zachary-foster do you need taxize anymore in taxa? its' in Imports, but I don't see any use of taxize in taxa vectorize branch. I need to have taxa in Imports in taxize now - and we can't have each others pkgs both in Imports. thoughts?

sckott commented 4 years ago

weird, even after removing taxize from Imports in taxa, i keep getting a circular dependency error when i run check on taxize (with taxa in Imports) - i can't find taxize in taxa anywhere, not sure what's going on. it's clearly taxa since if I remove taxa from Imports, then check runs fine

zachary-foster commented 4 years ago

Yea, we should remove the taxize dependency. I will try to remove the dependency and see if I get the same problem.

sckott commented 4 years ago

thanks for having a look - relevant taxize branch i'm working on is taxa-work

zachary-foster commented 4 years ago

I got the same problem. It looks like R CMD check is looking up the dependencies from CRAN oddly enough. I tried the check with my internet turned off and got this error:

─  checking package dependencies ...Warning: unable to access index for repository https://CRAN.R-project.org/src/contrib:
     cannot open URL 'https://CRAN.R-project.org/src/contrib/PACKAGES'
   Warning: unable to access index for repository https://bioconductor.org/packages/3.11/bioc/src/contrib:
     cannot open URL 'https://bioconductor.org/packages/3.11/bioc/src/contrib/PACKAGES'
   Warning: unable to access index for repository https://bioconductor.org/packages/3.11/data/annotation/src/contrib:
     cannot open URL 'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/PACKAGES'
   Warning: unable to access index for repository https://bioconductor.org/packages/3.11/data/experiment/src/contrib:
     cannot open URL 'https://bioconductor.org/packages/3.11/data/experiment/src/contrib/PACKAGES'
   Warning in url(sprintf("%s/%s", cran, path), open = "rb") :
     URL 'https://CRAN.R-project.org/web/packages/packages.rds': status was 'Couldn't resolve host name'
   Error in url(sprintf("%s/%s", cran, path), open = "rb") : 
     cannot open the connection to 'https://CRAN.R-project.org/web/packages/packages.rds'
   Execution halted

So I guess its only checking the version on CRAN?

I saw you added:

Remotes: ropensci/taxa@vectorize-no-taxize

So Im not sure. Could it be a bug in devtools or R CMD check?

sckott commented 4 years ago

Weird, good catch. I am using R CMD CHECK on the command line so there's no devtools or remotes pkgs involved AFAIK.

Tried it with internet off too and now it doesn't throw that circular dependency error - internet back on and it does.

sckott commented 4 years ago

vectorize-no-taxize just removes taxize from imports in taxa. and it doesn't seem to help compared to the vectorize branch

zachary-foster commented 4 years ago

Ok, so it must be relying on the CRAN version to check dependencies. Thats annoying, but I suppose the problem will go away once taxa on CRAN is updated.

Another option would be to rename taxa to taxa2 or something. I was wondering whether that would be a good idea to do anyway, so that after taxa is updated, published code that uses taxa (mostly papers using metacoder) will still work. About 100 papers have cited metacoder; I am not sure how many of those have published code though.

sckott commented 4 years ago

The changes in taxa are quite large, so it might warrant a new name, or at least a major version bump if the same name. I'll leave it up to you. I guess folks using the current taxa or older versions would just need to avoid updating to this new version if you keep the same name.