Closed sckott closed 7 years ago
Yes, I think having both types would be useful. A type of object that holds individual taxa or perhaps a list of taxa, for which each taxon is self-contained. This would provide a good contrast with the taxmap
objects, where it is assumed that there is data classified by taxa and taxa are not self-contained. Interconversion methods would be good to have. I have not thought about this much yet though.
We should think about a few use-cases for the other types of classes. taxmap
is designed to be as abstract as possible, not even specific to biology really, so it would be nice to have some classes that consider things like rank, like the taxa
class currently does.
How do you envision these kindof classes will be used? In taxize
?
I was thinking about this just now and I am having problem figuring out what information should be contained within a "Taxon".
thanks for all your thoughts, will respond soon, was driving all day 🚗
No problem. sounds good
Here is another quandary:
5) What about "placeholder" "taxa", like "undefined", "incertae sedis", "spp", and "Root"? Are these taxa? When comparing taxa, a "undefined" basidomycete is different than a "undefined" ascomycete, but if the whole classification hierarchy is not known or not contained within the taxon class then this is less clear.
How do you envision these kind of classes will be used? In taxize?
I envision similar to sp
, where other pkgs use taxa
for base taxonomy classes, either building on, reading from, writing to. In taxize
, we could have all the get_*()
fxns coerce to taxa
classes as their output (which means we need a way to have multiple taxa together of course)
- A single rank or a full classification? Is "Ascomycota" a taxon alone, or only in the context of "Eukaryota|Fungi|Ascomycota"? Should these two possibilities represent two different classes?
I think both should be allowed. And diff. classes, yes. That is, single names alone can be class taxon
e.g., and a hierarchy of names class hierarchy
- Multiple valid names? Is "Animalia" and "Metazoa" the same taxon? Should a taxon object allow for multiple names to take this into account?
We can't reasonably account for all these synonyms ourselves, so I guess we should assume the user or database has to supply the info. But if there is info on synonyms, then yeah, seems worth accounting for those.
- Multiple valid IDs? Same taxon in different databases? Arbitrary ID and a database ID?
Right, and each database can have a somewhat different hierarcy, each of which taxa has a different ID. Perhaps a class for a single database reference and its taxonomic names, then another class that combines data from two or more databases
- User-defined data? Should this be included in the taxon object itself?
What do you mean by user defined data? like the kind of data in the taxmap examples? I think I was thinking of just taxonomy data in the taxon classes
- What about "placeholder" "taxa", like "undefined", "incertae sedis", "spp", and "Root"? Are these taxa?
I think they have to be considered/included somehow. E.g., thinking about ecologists, there's often unknown species, where the lowest known name is a family e.g.,
When comparing taxa, a "undefined" basidomycete is different than a "undefined" ascomycete, but if the whole classification hierarchy is not known or not contained within the taxon class then this is less clear.
Right, they are different, but if the user doesn't supply the information, then it'd be hard for us to automatically pull that out. e.g., if they give undefined basidomycete
, we could try to parse out a taxonomic name from it, but we'd need something more sophisticated than we currently have. They could give undefined basidomycete
as the name, then supply Basidiomycota
as the phylum in another class
use cases for the taxa
classes:
binomen
Right now, binomen
defines taxonomic classes AND has functions for manipulating those classes (combining, separating, sorting, etc.). With classes being defined in this pkg, we can remove taxonomic classes from binomen
, and it will only have the functions to manipulate taxonomic classes
taxize
All get_*()
functions that get IDs for one or more taxa could instead of giving back the simple S3 class that's just the ID with some attributes, we could coerce to a taxa
class and give that back - Then that class is a known thing that we can coerce to other things, like a taxmap
class
@zachary-foster okay, pushed up some changes to the taxa classes - reinstall and see egs.
I have more notes i wrote down on paper for use cases and classes, will put those here
also, more to do:
@sckott Nice, I will look at your changes now. In the mean time, here is my thoughts on your comments above.
We can't reasonably account for all these synonyms ourselves, so I guess we should assume the user or database has to supply the info.
Yes, I think it would be sufficient to have the name
field accept multiple values and have any comparison functions take that into account. This could be useful for people in metagenomics who might start out with arbitrary names and identity them during an analysis (e.g. a taxon can be both "OTU1" and "bacillus"). It also could be used when combining different taxonomies, although automating that might not be possible.
Right, and each database can have a somewhat different hierarcy, each of which taxa has a different ID. Perhaps a class for a single database reference and its taxonomic names, then another class that combines data from two or more databases
Hmm, I think that adding another class specifically for multiple databases might complicate things more than it is worth since we might end up making a whole new set of manipulation functions for it. How about having the database_id
be its own simple class and be able to add multiple database_id
to taxon
? If we add the database_id
at the taxon
level instead of the hierarchy
level we can avoid the differing hierarchy problem. A hierarchy
's ID can be just the database_id
list of the tip taxon
. If we do this, it might be good to have an analogous database_name
class.
I think I was thinking of just taxonomy data in the taxon classes
That sounds fine.
Right, they are different, but if the user doesn't supply the information, then it'd be hard for us to automatically pull that out....
How about having a database_id
with a value of NA
for unknown taxa?
have plural versions of each class?
Hmm, not sure. A hierarchy
is pretty much an ordered plural of taxon
, but that is different than a list of taxon
. Can you think of any information that would apply to a list of taxon
or hierarchy
but not a single object? A simple list of objects might be sufficient. But then again, we would probably want a custom print method for a list of taxon
or hierarchy
, which would require a plural class. We could have taxa
and taxonomy
for the plurals of taxon
and hierarchy
.
when there's IDs for every rank and name pair, not just the 1 target name, how to handle that
Have the IDs associated with the individual taxa and not the hierarchy as a whole?
Hmm, I think that adding another class specifically for multiple databases might complicate things more than it is worth since we might end up making a whole new set of manipulation functions for it. How about having the database_id be its own simple class and be able to add multiple database_id to taxon? If we add the database_id at the taxon level instead of the hierarchy level we can avoid the differing hierarchy problem. A hierarchy's ID can be just the database_id list of the tip taxon. If we do this, it might be good to have an analogous database_name class.
Sounds good to have a simple class for database_id
and can add multiple to a taxon. I don't think a hierarchy itself needs an ID - all the taxa within it will have IDs - hierarchy does need metadata about which database it came from (the database_name
class/string)
How about having a database_id with a value of NA for unknown taxa?
sounds good
Hmm, not sure. A hierarchy is pretty much an ordered plural of taxon, but that is different than a list of taxon. Can you think of any information that would apply to a list of taxon or hierarchy but not a single object? A simple list of objects might be sufficient. But then again, we would probably want a custom print method for a list of taxon or hierarchy, which would require a plural class. We could have taxa and taxonomy for the plurals of taxon and hierarchy.
Right, multiple of any class could simply be a list. But as you said we could attach a S3 class to the list of many taxon's, or whatever the class is, so we can make it easy to know what to do downstream (whereas if it's just a list, we have to do checks to make sure it's what we expect it to be)
Have the IDs associated with the individual taxa and not the hierarchy as a whole?
sounds good
I looked through the code you put up recently and it seems to be a good fit for classical species-based taxonomic data, the type you would find in ecological studies/surveys. The only concerns I have is that it assumes that the user has, or is mostly interested in, species-level information (the name
class). In my work (metagenomics), we often don’t have species information, although you have more experience than I do in what people generally want.
Also, taxonomic names are present in both the grouping
and name
classes, which confused me at first. If there was a function that returned a supertaxon/subtaxon from a taxon
object I would expect the output to be another taxon
object; it seems like the output would be a character
from the grouping
of that taxon
object in this implementation?
In my work (metagenomics), we often don’t have species information
i assume you mean you could just have an ID, and no name at all, right?
If there was a function that returned a supertaxon/subtaxon from a taxon object I would expect the output to be another taxon object
right, makes sense
it seems like the output would be a character from the grouping of that taxon object in this implementation?
right
i assume you mean you could just have an ID, and no name at all,
Partially, yes, that often happens. But what I meant was that sometimes a sequence can only be assigned to a coarse taxonomic rank (e.g., family or phylum) and the species or genus can not be determined.
Ah right. I see what you mean. Do the changes in https://github.com/ropenscilabs/taxa/tree/taxa-class-rework account for this now?
Yes, just being able to define hierarchies
without species information does the trick.
I went through taxa-class-rework
and it looks good. I like the print methods.
We can probably close this too?
sounds good,
@zachary-foster now that the last PR is merged
I assume it still makes sense to include classes and methods based not around data.frame's as well? And perhaps methods to go between the two? Or are you thinking differently about this?