Open zachary-foster opened 5 years ago
thanks @zachary-foster - it's complex so I'll take a bit to make sure I understand all of this. For my immediate concern about looking at the work on the eval
branch, will the changes proposed above conflict with the eval branch work? If so, i guess I'd hold off on incorporating taxa in taxize
it's complex so I'll take a bit to make sure I understand all of this.
Yea, its a big change conceptually, but I think the amount code change needed will not be too big though. The summary is: every class would hold multiple items and all complex classes (Taxonomy
, Hierarchy
, Taxmap
) would inherit Taxon
(which is currently Taxa
) to reduce code redundancy.
will the changes proposed above conflict with the eval branch work
The changes would build on the eval
branch work. For example, in eval
, Taxonomy
already inherits Taxa
, but that had no effect on how the class is used, just reduced code redundancy for the getters/setters. From a user perspective, the only big changes that would be the removal of the Taxon
and Hierarchy
classes as they are now (Taxa
would be renamed Taxon
, etc).
If so, i guess I'd hold off on incorporating taxa in taxize
Perhaps, it depends what parts of taxa you would be using most. Do you want to do a conference call to discuss the needs of taxize and so I can explain the changes with more context?
Yes, a chat would be good. i'll email you to schedule that
Dear people, sorry for absence: I'm just back from Kenya. Since I'm not into R6, I doubt, I can make any relevant contribution on this topic. But maybe discuss about the issue of synonyms:
I think, synonyms have been the mid-way solution to harmonize nomenclatures in vegetation-plot databases, and maybe also in taxonomy. Perhaps you are right, a taxon concept applied according to a taxon view have only one way to be called, namely using its accepted name. Nobody referring to the same taxon concept after the same taxon view will use a synonym for calling the taxon. On the other side, two taxon concepts from different taxon views may be equivalent, independent whether they have the same usage name or different ones.
Thus using a taxon view will not really require to apply synonyms but to establish relations between different taxon concepts that may be non-related at all, partially related or equivalents. On this regard, a taxon concept included in another "bigger" taxon may share the same taxonomic rank but be defined by different taxon views. This is tricky, because the same relation can occur between parent and child. Moreover, one taxon concept may share parts with more than one concept in another taxon view. And so forth.
Anyway, this is how I understand nomenclature and taxonomy following these references:
Now may questions:
1) Is taxa
considering taxon views? May (or may not) be the database the taxon view?
2) I know, it sounds crazy but may nomenclatorial relationships (accepted name vs. synonyms) be treated the same as taxonomical relationships (parent vs. children) in the context of biodiversity information?
I was serously thinking to attempt a sort of taxlist2
object for enabling "multiple taxon views" but I'm facing those concerns.
Of course, I don't have any urgency for a solution but it seems to be that the topic had been discussed for around 25 years without having a proper answer.
Thanks for the thoughts @kamapu!
I'm just back from Kenya.
Cool! Sounds likes an interesting place to do field work in.
a taxon concept applied according to a taxon view have only one way to be called, namely using its accepted name. Nobody referring to the same taxon concept after the same taxon view will use a synonym for calling the taxon. On the other side, two taxon concepts from different taxon views may be equivalent, independent whether they have the same usage name or different ones.
I think I understand. Synonyms are easy enough to deal with, but homonyms and taxa that overlap in what they include are difficult to encode. After thinking about this a bit, I am leaning toward thinking of a "taxon" in the general sense as an unknown, and unknowable idealized grouping. In order to encode how conflicting taxon concepts compare, we would need some kind of unique ID for the actual individual organisms we are trying to organize, such as an herbarium voucher. Such information is not usually available and would change over time due to evolution of organisms and changes in our understanding of biology. Therefore, rather than trying to defined exactly what the taxon is and how it related to other taxa, I am leaning towards a flexible data model that will give the user the freedom to apply it as they like.
Anyway, this is how I understand nomenclature and taxonomy following these references:
Thanks! I read both these and now I have a better idea of what you mean by taxon view. Its a useful concept.
Is taxa considering taxon views?
I am after reading those papers. I dont know scott's opinion on it. I don't think most users will need that level of detail, but it would be invaluable to a select few and might facilitate some interesting research and tools.
May (or may not) be the database the taxon view?
Its kindof functions as a taxon view, in the sense that it is an authority to reference, however, its pretty specific to online databases at the moment and since these are not the ultimate source of taxonomic authority, the concept is different from a taxon view.
I know, it sounds crazy but may nomenclatorial relationships (accepted name vs. synonyms) be treated the same as taxonomical relationships (parent vs. children) in the context of biodiversity information?
hmm, I am not sure what you mean by that. Like, taxon views that reference and adapt other taxon views moving forward in time can be thought of as a tree of relationships? A taxon view family tree? Similar to the "Operational trees" in Zhong et al. (1996)?
Of course, I don't have any urgency for a solution but it seems to be that the topic had been discussed for around 25 years without having a proper answer.
Haha! Yea, its a tough problem and not really interesting to most people, even though it has big implications for interpreting historical information digitally.
I wrote a bunch of confusing stuff below, mostly for my own notes, so don't feel obliged to read it or respond to it unless you want to. I would of course welcome feedback if you are so inclined : )
Here is the best model I have so far if we really want to try to encode all this confusion:
Below is how such a model might be implemented. However, it is excessively complicated for 95% of what people typically need to do with taxonomic data, so I am not proposing it for taxa
at this time. Something to think about perhaps though. I would consider it for taxa
if it could appear simple to the user who does not need to think about conflicting taxonomies. All of the objects below could be singular or plural (i.e., "vectorized"). The descriptions below are worded as if they are singular.
taxon_view
: This contains the publication bibliographic information for someone's claim about taxonomic relationships. What the actual claim is is not encoded in the taxon_view
itself, but rather the existence of a taxon_view
in an object such as a taxon name or rank encodes the claim. taxon_database
: This indicates a database which some taxonomic information is available. taxon_id
: This contains 0-1 taxon_database
objects and stores the unique identifier of a taxon in a database.taxon_rank
: This contains 0-1 taxon_database
objects and 0-1 taxon_view
objectstaxon_name
: This contains 0-1 taxon_database
objects and 0-1 taxon_view
objects. It might also contain other information like the authority.taxon_link
: This contains 0-1 taxon_view
objects and points to another taxon, probably via taxon ID, although I think R6 classes would allow for actual C-style pointers to objects perhaps. This also could be replaced with an edge list with an extra column of taxon_view
.taxon
: This contains 0+ taxon_name
, taxon_id
, taxon_rank
, and taxon_link
objects for parents and children. The taxon_view
in the taxon_link
object for children would be the taxon view for the taxon as a whole.A set of interlinked taxon
objects would then be a taxonomy. There would be multiple paths from a root to a tip in such a taxonomy, representing conflicting classifications. The tips would represent whatever we are trying to group with the taxonomy. Alternatively, the taxon_link
objects could be replaced with an edge list with an extra column of taxon_view
, which would be similar to how the current taxonomy
class is set up.
I did not like to insert it here, but there are some brief discussions here.
Thats fine @kamapu, I will take a look. thanks
Hi @sckott, want to take a look at the vectorize
branch and see what you think about the new taxon_db
, taxon_id
, taxon_name
, taxon_rank
, and taxon
vectors? I think they are working for the most part. The tests and vignettes have not been updated, but the man pages have examples that should give you an idea of how they work. The exciting part for me is that they can now be used as columns in tables and work with most base R functions.
@zachary-foster yep, i'll take a look ..
sorry for the delay! just back from vacation.
Trying to install vectorize branch and getting errors related to vctrs - same problem with the dev and cran version of vctrs. Any thoughts?
Hi Scott, no problem, I hope the vacation went well!
It should be working now. There was a function renamed in vctrs since I last installed. I fixed it and now I can install using the github version of vctrs
Looks good for the most part.
is this intended behavior?
x <- taxon(name = c('Homo sapiens', 'Bacillus', 'Ascomycota', 'Ericaceae'),
rank = c('species', 'genus', 'phylum', 'family'),
id = taxon_id(c('9606', '1386', '4890', '4345'), db = 'ncbi'),
auth = c('Linnaeus, 1758', 'Cohn 1872', NA, 'Juss., 1789'),
info = list(list(n = 1), list(n = 3), list(n = 2), list(n = 9)))
x[taxon_rank(x) > 'family']
#> <taxon[2]>
#> [1] 9606|Homo sapiens Linnaeus, 1758|species 1386|Bacillus Cohn 1872|genus
#> Rank levels: phylum < family < genus < species
#> Databases: ncbi(id)
#> Info keys: n(2)
Looks like the eg returns ranks less than family, not greater than. Or am I interpreting wrong what the example does?
Nice
Its intended, but we could do it the other way if that is more intuitive. The way I set it up, high numbers indicate more fine scale ranks so species is greater than genus. I think I did that so that it looks like a factor when printing while displaying the ranks phylum -> species vs species -> phylum:
#> Rank levels: phylum < family < genus < species
vs
#> Rank levels: species < genus < family < phylum
But if think the oppisit ordering convertion is more intuitive, we could print it like:
#> Rank levels: phylum > family > genus > species
Then specifying the ranks manually would look like:
taxon_rank(c('A', 'B', 'C', 'D', 'A', 'D', 'D'),
levels = c(D = NA, A = 30, B = 20, C = 10))
instead of (how it is currently):
taxon_rank(c('A', 'B', 'C', 'D', 'A', 'D', 'D'),
levels = c(D = NA, A = 10, B = 20, C = 30))
Here are the numbers corresponding to the ranks currently (if not specified by the user):
> taxa:::rank_ref
domain superkingdom kingdom subkingdom superphylum infrakingdom
10 20 30 40 50 50
phylum division subphylum subdivision infradivision superclass
60 60 70 70 80 90
class subclass infraclass megacohort supercohort cohort
100 110 120 130 140 150
subcohort infracohort superorder order suborder infraorder
160 170 180 190 200 210
parvorder superfamily family subfamily tribe subtribe
220 230 240 250 260 270
genus subgenus section subsection species group species subgroup
280 290 300 310 320 330
species infraspecies subspecies variety varietas subvariety
340 350 360 370 370 380
race stirp form forma morph aberration
380 390 400 400 400 410
subform unspecified no rank clade
420 NA NA NA
We could make the number decrease going from phylum to species if that is more intuitive
Ah okay, right we also have the same ordering in the rank_ref dataset in taxize. That makes sense then
Ok, cool. That list is mostly derived from taxize::rank_ref
, mostly just reformatted
What do you think of adding support for synonyms to the taxon
class? It could be a list of taxon_name
vectors. There could either be a separate list for synonyms or we could replace the taxon_name
vector in the taxon
class with a list of taxon_name
vectors and the first in the list could be the "accepted" name.
Also should the authorities be associated with the taxon
or taxon_name
class? If there were synonyms, would it be better to move it to the taxon_name
class so that each synonym could have a different authority?
It would be nice to modify %in%
and ==
so that:
'Bacillus' %in% x
would return TRUE
when any synonym is matched.
makes sense to add synonyms. isn't it possible that a user would creat a taxon
class that is a name or set of names that are not accepted names?
I think each synonym should be able to have a different authority for sure. - they're just like any other name i guess, with their own authority, id, etc.
I can't quite visualize how this will work since taxon
class can have an arbitrary set of taxa in it, so where would you organize a set of names that are related (in this case synonyms of one another)
isn't it possible that a user would create a taxon class that is a name or set of names that are not accepted names?
Yea, I guess i was thinking more about how they were stored in the object. So this:
name = c('A', 'B', 'C')
synonym = list(c('D', 'E'), character(0), character(0))
vs
name = list(c('A', 'D', 'E'), 'B', 'C')
After thinking about it, either would work. When I said "accepted" I was thinking the one that would appear in the print out and the output of as.character
and such. So c('A', 'B', 'C')
in the above example.
I think each synonym should be able to have a different authority for sure. - they're just like any other name i guess, with their own authority, id, etc.
Hmm, thats a good point. I had thought each should have their own authority, but yea, they could also have different ranks, which would make things more complicated. Do synonyms ever have different IDs in databases? If so, then synonym would have to be defined using taxon
objects rather than taxon_name
.
I can't quite visualize how this will work since taxon class can have an arbitrary set of taxa in it, so where would you organize a set of names that are related (in this case synonyms of one another)
I was thinking of just having multiple names by adding a list of taxon_name
vectors of equal length to the number of taxa. So for each taxon, in addition to its name, there would be a vector of synonym names. but if each synonym can have a rank/id/authority, then that wont work.
It seems like there are two different things that we are trying to encode here:
What do you think about these set of changes to support these two concepts:
taxon
class to taxon_name
. It could also be called taxon_view
to mirror the concept in taxlist
. This would encode # 1 of the two concepts above.taxon_name
class and replace it with a character vector where it is currently used.taxon_rank
class. That way the database of the taxon_name
class (currently the taxon
class) would be defined by the taxon_db
object in its taxon_id
field, which would simplify things.taxon
vector class that is a list of taxon_name
vectors (currently called taxon
). This would encode concept # 2 in the above list. When there are no synonyms, this taxon
class would behave almost the same as the current taxon
class.but if each synonym can have a rank/id/authority
Perhaps I'm not that in into technical properties of taxa
to deliver a proper opinion, but I'm wondering why a synonym or name have to be associated to a taxonomic rank. In taxlist
, the taxonomic rank (level) is a property of the taxon concept (slot taxonRelations) and not of the taxon usage names (slot taxonNames).
In my opinion, there are two solutions to handle taxonomic lists, enabling their connection to observations and/or specimens:
taxlist
. This is the most pragmatical approach but less flexible one. We assume, there is always one, most accepted opinion for the circumscription of a taxon. Further opinions will be then considered by using synonyms. The big and many times hidden problem on this approach is that the reasons how a name becomes synonym are diverse and in some cases two taxa may share a synonym.Therefore:
taxlist2
object as "not-a-good-idea". Not yet... at least... ...or maybe now or never...Thanks for the input @kamapu!
I'm wondering why a synonym or name have to be associated to a taxonomic rank.
I think it makes sense that a synonym can have a different rank, although I have not thought about it much before. You see it often below/at the species rank:
taxize::synonyms('homo sapiens', db = 'col')[[1]][1:10, 1:3]
#> ══ 1 queries ═══════════════
#>
#> Retrieving data for taxon 'homo sapiens'
#> ✔ Found: homo sapiens
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
#> id name
#> 1 27ef25a84b47c43bdf06e027d05e1e06 Homo aethiopicus
#> 2 475b397e5b049d4d6510e289036b7dcb Homo americanus
#> 3 c209900e7e04b5a67b8e3246e5215c66 Homo arabicus
#> 4 5d326b11b8bb53e1394c8c40cef4aa97 Homo australasicus
#> 5 a1723fb1a1b2d6d3e09f1ec6bae7fe9b Homo cafer
#> 6 1d92d3b9a7c4c85333d330a5d404e401 Homo capensis
#> 7 895d6bd6e6ef0e5bf2269f8df8cf05cb Homo columbicus
#> 8 2cb9d2a4281f3d88733d0e79940f01c3 Homo drennani
#> 9 00d994548f2f2e12980d4843d55d74e0 Homo fossilis proto-aethiopicus
#> 10 f85224c601bb5d16719b1c909d198531 Homo fossilis protoaethiopicus
#> rank
#> 1 species
#> 2 species
#> 3 species
#> 4 species
#> 5 species
#> 6 species
#> 7 species
#> 8 species
#> 9 infraspecies
#> 10 infraspecies
Created on 2019-07-22 by the reprex package (v0.3.0)
I dont think there is a problem with synonyms having ranks, as long as the all the taxon ranks in a taxon concept are more fine-scale than all the ranks in the parent taxon concept.
A single opinion/taxon concept approach
Since much of the information that will be used with taxa
will come for databases with diverse assumptions, I agree this approach might not fit all types of data well.
A multiple taxon concept approach
This is what I am going for, but there will be a kind of "first among equals" behavior when functions need a single taxon name (e.g. labeling a taxonomic tree). I was thinking the first synonym is the "representative" taxon name, but does not have more weight than the others in most operations.
I assume synonyms does not make sense because every name will be associated to one taxon concept according to a taxon view.
The way I was thinking of it, there could still be multiple synonyms in a single taxon concept. Basically, if they have the same child taxa, they are synonyms. In the way I would encode it, F. ovina L. in Figure 1 would all be different taxon concepts (homonyms) since they have different child taxa. The two F. ovina agg. would also be different taxon concepts, but they would be the same concept (synonyms) if the placement of F. guestfalica did not differ between them.
Anyway, in this case the taxon concepts will be related to each other in terms of shared space between circumscriptions.
Is this the same as saying "if they have the same child taxa, they are synonyms"?
taxonomic rank almost irrelevant for defining relationships between taxon concepts.
Yea, I agree, its just an extra piece of information. taxa
will not require taxon ranks to work. They are there to make subsetting easier for the most part.
For the second approach, we can use Fig. 1 in Jansen & Dengler (2010). There the circumscription of Festuca ovina L. sec. Jäger & Werner (2005) is equivalent to the circumscription of Festuca ovina ssp. vulgaris var. vulgaris sec. Ascherson (1864). Thus, the taxonomic relationship between the second taxon concept and F. ovina L. sec. Ascherson (1864) due to taxonomic ranks is the same relationship as with the first taxon concept due to different views.
I am not sure I understand this. "Festuca ovina L. sec. Jäger & Werner (2005)" = "Festuca ovina ssp. vulgaris var. vulgaris sec. Ascherson (1864)"?
I know, the second approach is the most realistic and far more flexible one but it will make taxonomic data handle a real inferno.
It will get more complex, since taxon concepts can then have multiple parents. I think its doable. The hard part will be to make it easy to understand for the user when there are no synonyms, since most users will not need to consider synonyms. Basically, I want new users to be unaware synonyms are even supported until they need that functionality.
@zachary-foster looking at this in taxize:
get_ fxns
I'm not sure the vectorize
branch as is will work. I can't feasibly attach attributes on the output of get_ fxns and not lose them. e.g. I can just add attributes on a taxon_name
object, whic works. It's great that taxon_name
objects can be combined with c()
, but then the attributes are lost. That was the case before too, but trying to avoid attributes being lost now. And I'd need to create a new version of as_tibble.taxa_taxon_name
to be able to add attributes to the output data.frame, which doesn't seem like a great idea to override the taxa pkg method. I know you said you didn't want to deal with attributes though, so I'm not sure where that leaves us?
I might play around with creating a class on top of taxa::taxon_name
that will be able to hold attributes, be coerced to a data.frame, and be passed through to other fxns easily
taxa::classification()
taxize::classification(129313, db = 'itis')
- not sure how supertaxa and instances are supposed to worktaxa::tax_name()
/taxa::tax_rank()
Same concern about classifcation applies here wrt overriding taxize fxn names if taxa is loaded after taxize.
Hi @sckott,
That was the case before too, but trying to avoid attributes being lost now.
The vctrs
package should allow this. I must have overridden some method that breaks this for c
.
x <- vctrs::new_vctr(1:10, my_attr = 'result', class = 'test_class')
y <- vctrs::new_vctr(1:10, my_attr = 'different result', class = 'test_class')
attributes(x)
#> $my_attr
#> [1] "result"
#>
#> $class
#> [1] "test_class" "vctrs_vctr"
attributes(x[1])
#> $my_attr
#> [1] "result"
#>
#> $class
#> [1] "test_class" "vctrs_vctr"
attributes(c(x, x))
#> $my_attr
#> [1] "result"
#>
#> $class
#> [1] "test_class" "vctrs_vctr"
attributes(c(x, y))
#> No common type for `..1` <test_class> and `..2` <test_class>.
Created on 2019-10-24 by the reprex package (v0.3.0)
Note how this does not work if the attributes have different values. For example, I had to overwrite c.taxa_taxon_name
to make the taxon ranks be combined correctly, since they are stored in an attribute.
If we can come up with a general rule for combining attributes with different values, I can probably make it work. I can add a ...
to define attributes in the taxa constructors using the vctrs
functionality and add a c
method to preserve the attributes how we choose.
I know you said you didn't want to deal with attributes though, so I'm not sure where that leaves us?
I was talking about the per-value attributes (e.g. some info on each taxon). I had not considered per-vector/object attributes before. I am fine with adding either, I was just saying that per-value attributes might be better handled by columns in a table containing a taxa
vector.
I might play around with creating a class on top of taxa::taxon_name that will be able to hold attributes
That was the strategy I was going to suggest when a specific attribute needs a specific way of being combined. Either way, making the taxa
vector preserve attributes as much as possible would be a good idea.
One problem here is that this fxn is the same name as taxize has.
Good point, I should have noticed that. Perhaps we should change its name back to hierarchy
.
what are instances and are they needed?
The instances are the index of a taxon in the taxonomy. In the classification
print out, they are the most fine-scale taxa (i.e. the leafs).
how would you propose to construct a taxa::classification object from the current objects we return for ...
We would use the algorithm from parse_tax_data
in the current taxa
. I haven't adapted it yet for classification
, but I imagine some alternative constructor built on it that takes a list of anything that can be coreiced into a list of taxon
vectors, where each vector is the classification for a single taxon.
library(taxa)
#>
#> Attaching package: 'taxa'
#> The following object is masked from 'package:base':
#>
#> %in%
x <- taxize::classification(c(129313, 129310), db = 'itis')
x
#> $`129313`
#> name rank id
#> 1 Animalia kingdom 202423
#> 2 Bilateria subkingdom 914154
#> 3 Protostomia infrakingdom 914155
#> 4 Ecdysozoa superphylum 914158
#> 5 Arthropoda phylum 82696
#> 6 Hexapoda subphylum 563886
#> 7 Insecta class 99208
#> 8 Pterygota subclass 100500
#> 9 Neoptera infraclass 563890
#> 10 Holometabola superorder 914213
#> 11 Diptera order 118831
#> 12 Nematocera suborder 118832
#> 13 Culicomorpha infraorder 125808
#> 14 Chironomidae family 127917
#> 15 Chironominae subfamily 129228
#> 16 Chironomini tribe 129229
#> 17 Chironomus genus 129254
#> 18 Chironomus riparius species 129313
#>
#> $`129310`
#> name rank id
#> 1 Animalia kingdom 202423
#> 2 Bilateria subkingdom 914154
#> 3 Protostomia infrakingdom 914155
#> 4 Ecdysozoa superphylum 914158
#> 5 Arthropoda phylum 82696
#> 6 Hexapoda subphylum 563886
#> 7 Insecta class 99208
#> 8 Pterygota subclass 100500
#> 9 Neoptera infraclass 563890
#> 10 Holometabola superorder 914213
#> 11 Diptera order 118831
#> 12 Nematocera suborder 118832
#> 13 Culicomorpha infraorder 125808
#> 14 Chironomidae family 127917
#> 15 Chironominae subfamily 129228
#> 16 Chironomini tribe 129229
#> 17 Chironomus genus 129254
#> 18 Chironomus prior species 129310
#>
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "itis"
y <- lapply(x, function(y) taxon_name(name = y$name, rank = y$rank, id = y$id))
y
#> $`129313`
#> <taxon_name[18]>
#> [1] 202423|Animalia|kingdom 914154|Bilateria|subkingdom
#> [3] 914155|Protostomia|infrakingdom 914158|Ecdysozoa|superphylum
#> [5] 82696|Arthropoda|phylum 563886|Hexapoda|subphylum
#> [7] 99208|Insecta|class 100500|Pterygota|subclass
#> [9] 563890|Neoptera|infraclass 914213|Holometabola|superorder
#> [11] 118831|Diptera|order 118832|Nematocera|suborder
#> [13] 125808|Culicomorpha|infraorder 127917|Chironomidae|family
#> [15] 129228|Chironominae|subfamily 129229|Chironomini|tribe
#> [17] 129254|Chironomus|genus 129313|Chironomus riparius|species
#> Rank levels: kingdom < subkingdom < infrakingdom = superphylum < phylum < subphylum < class < subclass < infraclass < superorder < order < suborder < infraorder < family < subfamily < tribe < genus < species
#>
#> $`129310`
#> <taxon_name[18]>
#> [1] 202423|Animalia|kingdom 914154|Bilateria|subkingdom
#> [3] 914155|Protostomia|infrakingdom 914158|Ecdysozoa|superphylum
#> [5] 82696|Arthropoda|phylum 563886|Hexapoda|subphylum
#> [7] 99208|Insecta|class 100500|Pterygota|subclass
#> [9] 563890|Neoptera|infraclass 914213|Holometabola|superorder
#> [11] 118831|Diptera|order 118832|Nematocera|suborder
#> [13] 125808|Culicomorpha|infraorder 127917|Chironomidae|family
#> [15] 129228|Chironominae|subfamily 129229|Chironomini|tribe
#> [17] 129254|Chironomus|genus 129310|Chironomus prior|species
#> Rank levels: kingdom < subkingdom < infrakingdom = superphylum < phylum < subphylum < class < subclass < infraclass < superorder < order < suborder < infraorder < family < subfamily < tribe < genus < species
my_classification <- alternate_constructor(y)
#> Error in alternate_constructor(y): could not find function "alternate_constructor"
Created on 2019-10-24 by the reprex package (v0.3.0)
Same concern about classifcation applies here wrt overriding taxize fxn names
Yea, will have to do something about that. Either change the prefix for all the taxa
getters/setters (perhaps tx_*
) or change taxize
functions to lookup_name
and lookup_rank
or something.
Just looked at the get_
fxns again and I noticed that all the attributes are per-taxon values. I can see three ways of dealing with this, from easiest to hardest:
1) What about returning tables instead? Like
library(taxa)
#>
#> Attaching package: 'taxa'
#> The following object is masked from 'package:base':
#>
#> %in%
library(taxize)
#>
#> Attaching package: 'taxize'
#> The following objects are masked from 'package:taxa':
#>
#> classification, tax_name, tax_rank
library(tibble)
x = get_tsn(c("Chironomus riparius","Quercus douglasii"))
#> ══ 2 queries ═══════════════
#>
#> Retrieving data for taxon 'Chironomus riparius'
#> ✔ Found: Chironomus riparius
#>
#> Retrieving data for taxon 'Quercus douglasii'
#> ✔ Found: Quercus douglasii
#> ══ Results ═════════════════
#>
#> ● Total: 2
#> ● Found: 2
#> ● Not Found: 0
tibble(id = taxon_id(as.character(x), db = 'itis'), !!! attributes(x))
#> # A tibble: 2 x 6
#> id class match multiple_matches pattern_match uri
#> <tax_id> <chr> <chr> <lgl> <lgl> <chr>
#> 1 129313 (itis) tsn found FALSE FALSE https://www.iti…
#> 2 19322 (itis) tsn found FALSE FALSE https://www.iti…
Created on 2019-10-24 by the reprex package (v0.3.0)
2) build a vctrs
record class that with the same fields as taxon_id
, plus some extra ones. You would have to make methods for some of the coercion functions and perhaps others.
library(taxa)
#>
#> Attaching package: 'taxa'
#> The following object is masked from 'package:base':
#>
#> %in%
library(taxize)
#>
#> Attaching package: 'taxize'
#> The following objects are masked from 'package:taxa':
#>
#> classification, tax_name, tax_rank
library(tibble)
library(vctrs)
# Constructors
new_my_class <- function(id = character(), db = taxon_db(), match = match, .names = NULL) {
vctrs::new_rcrd(list(.names = .names, id = id, db = db, match = match),
.names_set = FALSE, # needed for taxon_id methods to work
class = c("taxize_my_class", 'taxa_taxon_id'))
}
my_class <- function(id = character(), db = NA, match = NA, .names = NULL) {
if (is.null(.names)) {
.names <- NA_character_
}
.names <- vctrs::vec_cast(.names, character())
id <- vctrs::vec_cast(id, character())
db <- vctrs::vec_cast(db, taxon_db())
match <- vctrs::vec_cast(match, character())
c(id, db, match, .names) %<-% vctrs::vec_recycle_common(id, db, match, .names)
new_my_class(.names = .names, id = id, db = db, match = match)
}
# S3 coercion functions
vec_ptype2.taxize_my_class <- function(x, y, ...) UseMethod("vec_ptype2.taxize_my_class", y)
vec_ptype2.taxize_my_class.default <- function(x, y, ..., x_arg = "", y_arg = "") {
vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
}
vec_ptype2.taxize_my_class.vctrs_unspecified <- function(x, y, ...) x
vec_ptype2.taxize_my_class.taxize_my_class <- function(x, y, ...) my_class()
x = get_tsn(c("Chironomus riparius","Quercus douglasii"))
#> ══ 2 queries ═══════════════
#>
#> Retrieving data for taxon 'Chironomus riparius'
#> ✔ Found: Chironomus riparius
#>
#> Retrieving data for taxon 'Quercus douglasii'
#> ✔ Found: Quercus douglasii
#> ══ Results ═════════════════
#>
#> ● Total: 2
#> ● Found: 2
#> ● Not Found: 0
x
#> [1] "129313" "19322"
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found" "found"
#> attr(,"multiple_matches")
#> [1] FALSE FALSE
#> attr(,"pattern_match")
#> [1] FALSE FALSE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=129313"
#> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322"
y = my_class(as.character(x), db = 'itis', match = attr(x, 'match'))
y
#> <taxon_id[2]>
#> [1] 129313 (itis) 19322 (itis)
class(y)
#> [1] "taxize_my_class" "taxa_taxon_id" "vctrs_rcrd" "vctrs_vctr"
is_taxon_id(y)
#> [1] TRUE
vctrs::fields(y)
#> [1] ".names" "id" "db" "match"
vctrs::field(y, 'match')
#> [1] "found" "found"
vctrs::field(y[1], 'match')
#> [1] "found"
vctrs::field(c(y, y), 'match')
#> [1] "found" "found" "found" "found"
Created on 2019-10-24 by the reprex package (v0.3.0)
3) Allow arbitrary per-value attributes as a list field in taxa
classes. I did experiment with have per-taxon attributes stored in a named list for each taxon, but eventually thought just including that info in table would be more straight-forward. It also might add a lot of overhead; i am not sure. We can try that rout again though
Sorry about the long delay on this @zachary-foster - had a bunch of new versions of other pkgs to get out.
Thanks for the good brainstorming. I'll play around with these and get back to you. My gut reaction is that I don't want to return tables from get_* fxns. It just doesn't seem like the right data structure, but maybe after trying it i'll change my mind.
No problem, I have been busy with traveling anyway.
I played around with this some more, and I think option 2 would probably be the most similar to how taxize
already works. That way you could have per-value attributes (as fields in rcrd-style vectors) that are preserved when subsetting and combining as well as per-vector attributes that are preserved when subsetting (vctrs
does both automatically). If you want users to see the attributes in the print output, you would need to add print methods. Either way, users could see and manipulate them with vctrs::field
.
sounds good - i'll have a look at option 2
Sorry about another long delay on this @zachary-foster - Are you still planning to move to using vctrs
?
Sorry for the delayed response. I have been on vacation after defending my PhD and was not checking my email much. I am back to work now.
Yea, I am still trying to get vctrs
to work. It was working nicely for the most part, although If I remember correctly a few obscure bugs and missing features in vctrs
was causing some issues. Those might be fixed by now
Hi @sckott, I have made some good progress of the vectorize
branch. If you have the time, I would like to go over it with you to get your opinions before I continue. Here is a summary of what I did since we last discussed it:
vctrs
updates casedvctrs
limitations and incomplete features (e.g. robust vector names for rcrd objects) taxon
class that was meant to support synonyms and renamed taxon_name
to taxon
. The old taxon
class was basically a list of data.frames (taxon_name
objects) under the hood, which had pretty bad memory management and performance. In my tests taxon_name
(single data.frame) was 100x faster and 100x smaller in memory for the same data, but cant support synonyms right now. I have some ideas how synonyms could be supported while still getting good performance. All the classes besides classification
are working as far as I can tell. taxonomy
needs more testing, since it is more complicated, but seems to work. Once we decide on how taxonomy
should work ideally, I can work on finishing classification
and adding taxmap
, which would be the last class needed.
Sounds good - did you want to do a call?
Yea, that would be good. Does zoom work for you?
Yeah, what day/time?
@zachary-foster Tried using taxa::taxon
in taxize. Works nicely so thats cool. But main problem is the attributes aren't subsetted along with id/name/rank/auth (and when coerced to data.frame the attributes are lost; would be nice to have the attributes I add go in to the data.frame, especially the URI for the taxon). Maybe instead of changing taxa to meet taxize's needs, I could extend taxon
in taxize?
Yea, I think you could extend taxon
in taxize. I did some quick tests, and it seems possible, although you might have to add a few custom methods as problems arise. Here is a proof of concept:
new_taxize_taxon <- function(.names = NULL, name = character(), rank = taxon_rank(), id = taxon_id(), auth = taxon_authority(), new_col = character(), ...) {
# Set names to NA if not set
if (is.null(names) || all(is.na(.names))) {
.names_set <- FALSE
.names <- vctrs::vec_recycle(NA_character_, length(name))
} else {
.names_set <- TRUE
vctrs::vec_assert(.names, ptype = character())
}
# Check that values are the correct type
vctrs::vec_assert(name, ptype = character())
# vctrs::vec_assert(rank, ptype = taxon_rank())
vctrs::vec_assert(id, ptype = taxon_id())
vctrs::vec_assert(auth, ptype = taxon_authority())
vctrs::vec_assert(new_col, ptype = character())
# Create new object
vctrs::new_rcrd(list(.names = .names, name = name, rank = rank, id = id, auth = auth, new_col = new_col),
.names_set = .names_set,
...,
class = c("taxize_test", "taxa_taxon"))
}
taxize_taxon <- function(name = character(0), rank = NA, id = NA, auth = NA, .names = NA, new_col = NA, ...) {
# Cast inputs to correct values
name <- vctrs::vec_cast(name, character())
rank <- vctrs::vec_cast(rank, taxon_rank())
id <- vctrs::vec_cast(id, taxon_id())
auth <- vctrs::vec_cast(auth, taxon_authority())
new_col <- vctrs::vec_cast(new_col, character())
.names <- vctrs::vec_cast(.names, character())
# Recycle ranks and databases to common length
recycled <- vctrs::vec_recycle_common(name, rank, id, auth, new_col, .names)
name <- recycled[[1]]
rank <- recycled[[2]]
id <- recycled[[3]]
auth <- recycled[[4]]
new_col <- recycled[[5]]
.names <- recycled[[6]]
# Create taxon object
new_taxize_taxon(.names = .names, name = name, rank = rank, id = id, auth = auth, new_col = new_col, ...)
}
vec_cast.taxize_test <- function(x, to, ..., x_arg, to_arg) UseMethod("vec_cast.taxize_test")
vec_cast.taxize_test.default <- function(x, to, ..., x_arg, to_arg) vctrs::vec_default_cast(x, to, x_arg, to_arg)
vec_cast.taxize_test.taxize_test <- function(x, to, ..., x_arg, to_arg) x
vec_cast.taxize_test.character <- function(x, to, ..., x_arg, to_arg) taxon(x)
vec_cast.character.taxize_test <- function(x, to, ..., x_arg, to_arg) as.character(vctrs::field(x, "name"))
x <- taxize_test(name = c('Homo sapiens', 'Bacillus', 'Ascomycota', 'Ericaceae'),
rank = c('species', 'genus', 'phylum', 'family'),
id = taxon_id(c('9606', '1386', '4890', '4345'), db = 'ncbi'),
auth = c('Linnaeus, 1758', 'Cohn 1872', NA, 'Juss., 1789'),
new_col = c('A', 'B', 'C', 'D'))
names(x) <- c('a', 'b', 'c', 'd')
vctrs::field(x[2:3], 'new_col')
thanks @zachary-foster - i;ll try that
I've used your example code above and built on it. work on this branch https://github.com/ropensci/taxize/compare/taxa-work Some thoughts:
taxon_id()
doesn't account for when there's no ID. I tried to change the regex to accept, e.g,. [0-9]+|(NA)
, but got an error using the setter, opened an issue https://github.com/ropensci/taxa/issues/207 .names
but it's a bit tricky done programmatically b/c there can be any number of taxon ids not found, so i can document for users how to set names themselves if they want names on the outputright now, the db checker in taxon_id() doesn't account for when there's no ID.
Good find. I fixed it so now NA
is a valid ID regardless of database.
but got an error using the setter, opened an issue #207
That should be fixed now
i tried settings names via .names but it's a bit tricky done programmatically b/c there can be any number of taxon ids not found, so i can document for users how to set names themselves if they want names on the output
I am not sure I understand. Even if a taxon ID is not found is NA
returned? Is the input length/order and output length/order different if there are unknown taxon IDs?
thanks for the fixes.
for the names applied via .names
- yes, NA is returned when there is no ID found in the taxize get_* functions. But if there's NA's I think it may not make sense to give the output taxon object vector names since you can't subset a vector AFAIK with NA's as the indices, and if there's more than one NA, then how do you index to the different NA's. So I think for taxize I lean towards not supplying names to the output vectors, but show the user how to do it if they want to
There seems to be an error in the print method when there is an NA, e.g., (or maybe that is correct, expected behavior)
x <- taxon(c('A', 'B', 'C'), .names = c('d', 'e', NA_character_))
x
#> <taxon[3]>
#> Error: If any elements are named, all elements must be named.
rlang::last_error()
<error/rlang_error>
If any elements are named, all elements must be named.
Backtrace:
1. (function (x, ...) ...
2. vctrs:::print.vctrs_vctr(x)
3. vctrs::obj_print(x, ...)
5. taxa:::obj_print_data.taxa_taxon(x, ...)
6. taxa:::printed_taxon(x, color = TRUE)
7. taxa:::font_tax_name(x)
12. taxa:::tax_rank.taxa_taxon(text)
13. taxa:::named_field(x, "rank")
15. vctrs:::`names<-.vctrs_vctr`(`*tmp*`, value = names(x))
Run `rlang::last_trace()` to see the full context.
How about naming the output by the input, assuming the input is one-to-one with the output and is a character vector?
That error seems to be due to vctrs
not allowing vctr
objects to be named with NA
. It was actually taxon_rank
causing the problem. taxon_rank
is a vcrt
object whereas most the others are rcrd
, for which I have custom naming code that allows NA
. From what I gather reading issues on the vctrs github, the vctrs developers seem undecided about allowing NA or "" to be valid names
I fixed the issue for taxon
in the mean time
Thanks for the fix.
Could name by the inputs, might try that
@zachary-foster do you need taxize anymore in taxa? its' in Imports, but I don't see any use of taxize in taxa vectorize branch. I need to have taxa in Imports in taxize now - and we can't have each others pkgs both in Imports. thoughts?
weird, even after removing taxize from Imports in taxa, i keep getting a circular dependency error when i run check on taxize (with taxa in Imports) - i can't find taxize in taxa anywhere, not sure what's going on. it's clearly taxa since if I remove taxa from Imports, then check runs fine
Yea, we should remove the taxize dependency. I will try to remove the dependency and see if I get the same problem.
thanks for having a look - relevant taxize branch i'm working on is taxa-work
I got the same problem. It looks like R CMD check is looking up the dependencies from CRAN oddly enough. I tried the check with my internet turned off and got this error:
─ checking package dependencies ...Warning: unable to access index for repository https://CRAN.R-project.org/src/contrib:
cannot open URL 'https://CRAN.R-project.org/src/contrib/PACKAGES'
Warning: unable to access index for repository https://bioconductor.org/packages/3.11/bioc/src/contrib:
cannot open URL 'https://bioconductor.org/packages/3.11/bioc/src/contrib/PACKAGES'
Warning: unable to access index for repository https://bioconductor.org/packages/3.11/data/annotation/src/contrib:
cannot open URL 'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/PACKAGES'
Warning: unable to access index for repository https://bioconductor.org/packages/3.11/data/experiment/src/contrib:
cannot open URL 'https://bioconductor.org/packages/3.11/data/experiment/src/contrib/PACKAGES'
Warning in url(sprintf("%s/%s", cran, path), open = "rb") :
URL 'https://CRAN.R-project.org/web/packages/packages.rds': status was 'Couldn't resolve host name'
Error in url(sprintf("%s/%s", cran, path), open = "rb") :
cannot open the connection to 'https://CRAN.R-project.org/web/packages/packages.rds'
Execution halted
So I guess its only checking the version on CRAN?
I saw you added:
Remotes: ropensci/taxa@vectorize-no-taxize
So Im not sure. Could it be a bug in devtools or R CMD check?
Weird, good catch. I am using R CMD CHECK on the command line so there's no devtools or remotes pkgs involved AFAIK.
Tried it with internet off too and now it doesn't throw that circular dependency error - internet back on and it does.
vectorize-no-taxize
just removes taxize from imports in taxa. and it doesn't seem to help compared to the vectorize
branch
Ok, so it must be relying on the CRAN version to check dependencies. Thats annoying, but I suppose the problem will go away once taxa on CRAN is updated.
Another option would be to rename taxa to taxa2 or something. I was wondering whether that would be a good idea to do anyway, so that after taxa is updated, published code that uses taxa (mostly papers using metacoder) will still work. About 100 papers have cited metacoder; I am not sure how many of those have published code though.
The changes in taxa are quite large, so it might warrant a new name, or at least a major version bump if the same name. I'll leave it up to you. I guess folks using the current taxa or older versions would just need to avoid updating to this new version if you keep the same name.
I have been thinking about how to make all of the
taxa
classes most useful and I think I have come up with a way to reorganize things that will make things more natural and R-like, while increasing flexibility.Vectorizing the
TaxonName
,TaxonRank
,TaxonId
, andTaxonDatabase
classesRight now these store individual values and an associated
TaxonDatabase
object. However, if they were like other R objects, they would allow multiple values. Currently, their primary use is to store information inTaxon
objects, but if they could hold multiple values, then they could be used on their own more usefully. They would also be useful as column in tibbles if this PR gets accepted.Question: do we allow multiple
TaxonDatabase
objects in a singleTaxonName
,TaxonRank
, orTaxonId
object with multiple values, or do all values have to come from the same database?Question: What happens when we combine objects from different database with
c.TaxonName
?We could allow to hold multiple values and behave like a factor.
Reorganizing inheritance
Until a recent changes in the eval branch, the class inheritance looked like this:
I think it would be much more elegant code-wise, to do this:
Taxonomy
has most of the important/complicated methods andTaxa
has most of the important getters/setters. In this setup,Hierarchies
would be the same asTaxonomy
internally, but have different getters/setters and print method to make it appear 1-dimentional. This would be easier for users to understand, but also preserve the performance improvements ofTaxonomy
. Also things likefilter_taxa(taxon_ranks != "family")
would work onHierarchies
objects without modification to the code forfilter_taxa
.Hierarchies
would be particularly useful as a column in atibble
if this PR gets accepted.Removing/hiding non-vetorized classes
We could remove/hide the
Hierarchy
andTaxon
class in favor of theTaxa
andHierarchies
classes, since the plural forms can do everything the singular forms can. We can also rename the plural form to the singular to be more in R style (character
is notcharacters
), so the inheritance hierarchy would become:Multiple names/ranks/ids per taxon
@kamapu pointed out that
taxa
does not support synonyms. TheTaxmap
is flexible enough that it can encode synonyms, but the other classes cannot. If we allow multipleTaxonName
,TaxonRank
, andTaxonId
objects, inTaxon
objects (or each item in aTaxa
object), then that would allow for disagreements among multiple sources of taxonomic information (same taxon from multiple databases). We would then need to define methods like==.Taxon
/==.Taxa
andas.character.Taxon
to account for this, but that is doable.Whew, that was a lot of stuff. I put it all in one issues since many of the changes are related and dependent on eachother, but if we decide to go this route, then I will split it up into multiple issues and make a milestone.
I would appreciate feedback from anyone who sees this, especially @sckott.