ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

Where to go now post merge of new data.frame based functions? #4

Closed sckott closed 7 years ago

sckott commented 8 years ago

@zachary-foster now that the last PR is merged

I assume it still makes sense to include classes and methods based not around data.frame's as well? And perhaps methods to go between the two? Or are you thinking differently about this?

zachary-foster commented 8 years ago

Yes, I think having both types would be useful. A type of object that holds individual taxa or perhaps a list of taxa, for which each taxon is self-contained. This would provide a good contrast with the taxmap objects, where it is assumed that there is data classified by taxa and taxa are not self-contained. Interconversion methods would be good to have. I have not thought about this much yet though.

zachary-foster commented 8 years ago

We should think about a few use-cases for the other types of classes. taxmap is designed to be as abstract as possible, not even specific to biology really, so it would be nice to have some classes that consider things like rank, like the taxa class currently does.

How do you envision these kindof classes will be used? In taxize?

zachary-foster commented 8 years ago

I was thinking about this just now and I am having problem figuring out what information should be contained within a "Taxon".

  1. A single rank or a full classification? Is "Ascomycota" a taxon alone, or only in the context of "Eukaryota|Fungi|Ascomycota"? Should these two possibilities represent two different classes?
  2. Multiple valid names? Is "Animalia" and "Metazoa" the same taxon? Should a taxon object allow for multiple names to take this into account?
  3. Multiple valid IDs? Same taxon in different databases? Arbitrary ID and a database ID?
  4. User-defined data? Should this be included in the taxon object itself?
sckott commented 8 years ago

thanks for all your thoughts, will respond soon, was driving all day 🚗

zachary-foster commented 8 years ago

No problem. sounds good

zachary-foster commented 8 years ago

Here is another quandary:

5) What about "placeholder" "taxa", like "undefined", "incertae sedis", "spp", and "Root"? Are these taxa? When comparing taxa, a "undefined" basidomycete is different than a "undefined" ascomycete, but if the whole classification hierarchy is not known or not contained within the taxon class then this is less clear.

sckott commented 8 years ago

How do you envision these kind of classes will be used? In taxize?

I envision similar to sp, where other pkgs use taxa for base taxonomy classes, either building on, reading from, writing to. In taxize, we could have all the get_*() fxns coerce to taxa classes as their output (which means we need a way to have multiple taxa together of course)

sckott commented 8 years ago
  1. A single rank or a full classification? Is "Ascomycota" a taxon alone, or only in the context of "Eukaryota|Fungi|Ascomycota"? Should these two possibilities represent two different classes?

I think both should be allowed. And diff. classes, yes. That is, single names alone can be class taxon e.g., and a hierarchy of names class hierarchy

  1. Multiple valid names? Is "Animalia" and "Metazoa" the same taxon? Should a taxon object allow for multiple names to take this into account?

We can't reasonably account for all these synonyms ourselves, so I guess we should assume the user or database has to supply the info. But if there is info on synonyms, then yeah, seems worth accounting for those.

  1. Multiple valid IDs? Same taxon in different databases? Arbitrary ID and a database ID?

Right, and each database can have a somewhat different hierarcy, each of which taxa has a different ID. Perhaps a class for a single database reference and its taxonomic names, then another class that combines data from two or more databases

  1. User-defined data? Should this be included in the taxon object itself?

What do you mean by user defined data? like the kind of data in the taxmap examples? I think I was thinking of just taxonomy data in the taxon classes

sckott commented 8 years ago
  1. What about "placeholder" "taxa", like "undefined", "incertae sedis", "spp", and "Root"? Are these taxa?

I think they have to be considered/included somehow. E.g., thinking about ecologists, there's often unknown species, where the lowest known name is a family e.g.,

When comparing taxa, a "undefined" basidomycete is different than a "undefined" ascomycete, but if the whole classification hierarchy is not known or not contained within the taxon class then this is less clear.

Right, they are different, but if the user doesn't supply the information, then it'd be hard for us to automatically pull that out. e.g., if they give undefined basidomycete, we could try to parse out a taxonomic name from it, but we'd need something more sophisticated than we currently have. They could give undefined basidomycete as the name, then supply Basidiomycota as the phylum in another class

sckott commented 8 years ago

use cases for the taxa classes:

binomen

Right now, binomen defines taxonomic classes AND has functions for manipulating those classes (combining, separating, sorting, etc.). With classes being defined in this pkg, we can remove taxonomic classes from binomen, and it will only have the functions to manipulate taxonomic classes

taxize

All get_*() functions that get IDs for one or more taxa could instead of giving back the simple S3 class that's just the ID with some attributes, we could coerce to a taxa class and give that back - Then that class is a known thing that we can coerce to other things, like a taxmap class

sckott commented 8 years ago

@zachary-foster okay, pushed up some changes to the taxa classes - reinstall and see egs.

I have more notes i wrote down on paper for use cases and classes, will put those here

also, more to do:

zachary-foster commented 8 years ago

@sckott Nice, I will look at your changes now. In the mean time, here is my thoughts on your comments above.

We can't reasonably account for all these synonyms ourselves, so I guess we should assume the user or database has to supply the info.

Yes, I think it would be sufficient to have the name field accept multiple values and have any comparison functions take that into account. This could be useful for people in metagenomics who might start out with arbitrary names and identity them during an analysis (e.g. a taxon can be both "OTU1" and "bacillus"). It also could be used when combining different taxonomies, although automating that might not be possible.

Right, and each database can have a somewhat different hierarcy, each of which taxa has a different ID. Perhaps a class for a single database reference and its taxonomic names, then another class that combines data from two or more databases

Hmm, I think that adding another class specifically for multiple databases might complicate things more than it is worth since we might end up making a whole new set of manipulation functions for it. How about having the database_id be its own simple class and be able to add multiple database_id to taxon? If we add the database_id at the taxon level instead of the hierarchy level we can avoid the differing hierarchy problem. A hierarchy's ID can be just the database_id list of the tip taxon. If we do this, it might be good to have an analogous database_name class.

I think I was thinking of just taxonomy data in the taxon classes

That sounds fine.

Right, they are different, but if the user doesn't supply the information, then it'd be hard for us to automatically pull that out....

How about having a database_id with a value of NA for unknown taxa?

have plural versions of each class?

Hmm, not sure. A hierarchy is pretty much an ordered plural of taxon, but that is different than a list of taxon. Can you think of any information that would apply to a list of taxon or hierarchy but not a single object? A simple list of objects might be sufficient. But then again, we would probably want a custom print method for a list of taxon or hierarchy, which would require a plural class. We could have taxa and taxonomy for the plurals of taxon and hierarchy.

when there's IDs for every rank and name pair, not just the 1 target name, how to handle that

Have the IDs associated with the individual taxa and not the hierarchy as a whole?

sckott commented 8 years ago

Hmm, I think that adding another class specifically for multiple databases might complicate things more than it is worth since we might end up making a whole new set of manipulation functions for it. How about having the database_id be its own simple class and be able to add multiple database_id to taxon? If we add the database_id at the taxon level instead of the hierarchy level we can avoid the differing hierarchy problem. A hierarchy's ID can be just the database_id list of the tip taxon. If we do this, it might be good to have an analogous database_name class.

Sounds good to have a simple class for database_id and can add multiple to a taxon. I don't think a hierarchy itself needs an ID - all the taxa within it will have IDs - hierarchy does need metadata about which database it came from (the database_name class/string)

How about having a database_id with a value of NA for unknown taxa?

sounds good

Hmm, not sure. A hierarchy is pretty much an ordered plural of taxon, but that is different than a list of taxon. Can you think of any information that would apply to a list of taxon or hierarchy but not a single object? A simple list of objects might be sufficient. But then again, we would probably want a custom print method for a list of taxon or hierarchy, which would require a plural class. We could have taxa and taxonomy for the plurals of taxon and hierarchy.

Right, multiple of any class could simply be a list. But as you said we could attach a S3 class to the list of many taxon's, or whatever the class is, so we can make it easy to know what to do downstream (whereas if it's just a list, we have to do checks to make sure it's what we expect it to be)

Have the IDs associated with the individual taxa and not the hierarchy as a whole?

sounds good

zachary-foster commented 8 years ago

I looked through the code you put up recently and it seems to be a good fit for classical species-based taxonomic data, the type you would find in ecological studies/surveys. The only concerns I have is that it assumes that the user has, or is mostly interested in, species-level information (the name class). In my work (metagenomics), we often don’t have species information, although you have more experience than I do in what people generally want.

Also, taxonomic names are present in both the grouping and name classes, which confused me at first. If there was a function that returned a supertaxon/subtaxon from a taxon object I would expect the output to be another taxon object; it seems like the output would be a character from the grouping of that taxon object in this implementation?

sckott commented 8 years ago

In my work (metagenomics), we often don’t have species information

i assume you mean you could just have an ID, and no name at all, right?

sckott commented 8 years ago

If there was a function that returned a supertaxon/subtaxon from a taxon object I would expect the output to be another taxon object

right, makes sense

it seems like the output would be a character from the grouping of that taxon object in this implementation?

right

zachary-foster commented 8 years ago

i assume you mean you could just have an ID, and no name at all,

Partially, yes, that often happens. But what I meant was that sometimes a sequence can only be assigned to a coarse taxonomic rank (e.g., family or phylum) and the species or genus can not be determined.

sckott commented 8 years ago

Ah right. I see what you mean. Do the changes in https://github.com/ropenscilabs/taxa/tree/taxa-class-rework account for this now?

zachary-foster commented 8 years ago

Yes, just being able to define hierarchies without species information does the trick.

I went through taxa-class-rework and it looks good. I like the print methods.

zachary-foster commented 7 years ago

We can probably close this too?

sckott commented 7 years ago

sounds good,