Adapt `taxmap` utility functions to `taxonomy`

zachary-foster commented 8 years ago

The following functions used to get information from taxmap objects should be easily adaptable to taxonomy objects:

[x] supertaxa
[x] subtaxa
[x] roots

One question is, what should these return by default? Currently, they return either indexes or IDs, depending on the index option. Indexes are useful because they are the fastest way to access information and make a big difference for large datasets, but they don’t respond to changes to the taxonomy. IDs are useful because they are not affected by changes to the taxonomy and can be mapped to other objects/tables. Now that there are taxon objects, we have the option of returning taxon objects or a taxa object. So, I’m thinking of replacing the index option with something like return_type that takes the following values: "index", "id", "taxa", or "hierarchies". I have found being able to choose ID or indexes very useful when using these functions.

By the way @sckott, do you mind adding me as a collaborator for this repository so I can add issue tags and assign myself to things? I wanted to assign myself this issue for example and I dont think I can as it is.

sckott commented 8 years ago

do you mind adding me as a collaborator

you should be able to now

zachary-foster commented 8 years ago

Great, thanks!

zachary-foster commented 7 years ago

@sckott, I am trying to implement this and I am not sure the best way to go about it. I can think of the following options:

Make subtaxa, supertaxa etc. R6 methods of taxonomy. e.g. data$subtaxa()
Make subtaxa, supertaxa etc. S3 generics functions with a method for taxonomy. e.g. subtaxa(data). I assume this is possible?
Combine 1 and 2, by having the R6 method call the S3 method so both are available.
Combine 1 and 2, by having the S3 method call the R6 method so both are available.

Things to keep in mind:

taxmap will need the same functions. If we have taxmap inherit from taxonomy, option 1, 3, or 4 will handle this elegantly (i think?).
I am guessing that most people will feel more comfortable with subtaxa(data) than data$subtaxa()

What do you think? Should these functions even be in this package, or do you want to have them in an associated manipulation package. These seem like basic manipulations to me and I lean towards having them in this package.

sckott commented 7 years ago

So, I’m thinking of replacing the index option with something like return_type that takes the following values: "index", "id", "taxa", or "hierarchies"

I like and support this

Are you leaning one way or the other wrt S3 vs. R6?

What do you think? Should these functions even be in this package, or do you want to have them in an associated manipulation package. These seem like basic manipulations to me and I lean towards having them in this package.

I think they should be in here. remember, I want to merge this with binomen, so we have sort of equivalent functionality with taxmap and taxa classes

zachary-foster commented 7 years ago

Are you leaning one way or the other wrt S3 vs. R6?

I am thinking that option 3 is probably the best, but I don’t have much experience with R6 so it is hard for me to choose between 3 and 4. I think I will start implementing option 3 and see how that goes. In general, it might be nice to haves S3 wrappers for the most important R6 functions.

sckott commented 7 years ago

That sounds good to me. We can rework if we need to as we're still pre-CRAn

zachary-foster commented 7 years ago

@sckott, I tested option 3 by adapting supertaxa to Taxonomy on the issue-9 branch. I had problems getting the roxygen documentation to pass R CMD check. For example, I could not get the usage to show both ways of using the function. It seems to be working, but its a little awkward to document.

Im not sure that calling the S3 method from R6 would be better; in that case, we would not be able to use private methods of Taxonomy without passing the private list to the supertaxa S3 method, which would be confusing to users, since the private parameter would have to be documented (i think). I tried this as well in branch issue-9-alternate so we can compare the two.

Can you look at the issue-9 and issue-9-alternate branches and see which you like better? Once we pick a strategy, I will implement the rest of the functions.

sckott commented 7 years ago

@zachary-foster having a look. Curious, why did you remove all the tests?

zachary-foster commented 7 years ago

I did it temporarily to pass R CMD check since some of the changed I made broke quite a few of them. All the tests so far were for taxmap and will have to be mostly rewritten anyway once that class is reworked. I still have them in metacoder and the master branch and I will use them as a template when it comes time to make tests for the reworked taxmap.

And any examples of usage of supertaxa?

You can take the taxonomy example and use supertaxa like so:

...
...
x <- taxonomy(hier1, hier2, hier3)
x$supertaxa()
supertaxa(x)

I had not decided how best to provide examples yet. We probably should have an example taxonomy dataset for examples.

sckott commented 7 years ago

Okay on tests, sounds fine.

Thanks for e.g.

We probably should have an example taxonomy dataset for examples.

Yes.

sckott commented 7 years ago

I think I prefer the version on branch issue-9

zachary-foster commented 7 years ago

Thats the way I was leaning to. I will continue work on that branch then. Thanks for looking it over!

sckott commented 7 years ago

cool^3

sckott commented 7 years ago

@zachary-foster any progress on this?

zachary-foster commented 7 years ago

Not too much unfortunately. I ran into a difference in how the edge list is constructed using your algorithm vs mine, which made these functions harder to adapt that I expected and I have not had enough time to think about this enough. Basically, my functions assume there is one edge per taxon, even if there is no edge leading to it (They connect to NA in that case). This allows the edge list to be indexed the same as any attributes associated with taxa (as in the taxmap object). Its also allows for quick identification of roots by testing something like is.na(edgelist$from) rather than edgelist$to %in% edgelist$from, which is an all-vs-all comparison and slow on large trees. Finding roots is important for optimizing some of the algorithms that traverse the tree.

Sorry for the delay, I will try to get back to this soon. I have had a series of other things to work on that have been taking up all my time; I am a phd student currently and I have a lot of projects going on right now. Since we have written the publication for metacoder, my advisor no longer considers this a priority so all the work on taxa has been on my free time, which has become scarce lately.

sckott commented 7 years ago

Okay, thanks for the explanation. Let me know if I can help change anything to make this easier.

Totally understand about time constraints.

sckott commented 7 years ago

@zachary-foster if okay with you, i want to move forward with this pkg.

I can try to do the changes you were going to do, but if I can't quite understand, I'll see about putting those changes on hold until you can get around to them while still releasing a first version to CRAN that will easily allow the future changes.

zachary-foster commented 7 years ago

@sckott, sorry for the inactivity. I looked at this issue for a bit today and I think I have a solution to the edge list issue. I am trying it out now. I am ok with an adaptable CRAN release if we cant get this working soon.

sckott commented 7 years ago

okay, thanks @zachary-foster

zachary-foster commented 7 years ago

Hi @sckott, I think I have implemented a decent solution to the edge list problem. So the algorithm you had there worked in most cases, but did not differentiate taxa with the same name, but different lineages:

mammalia <- taxon(
  name = taxon_name("Mammalia"),
  rank = taxon_rank("class"),
  id = taxon_id(9681)
)
plantae <- taxon(
  name = taxon_name("Plantae"),
  rank = taxon_rank("kingdom"),
  id = taxon_id(33090)
)
unidentified <- taxon(
  name = taxon_name("unidentified"),
  rank = taxon_rank("species"),
  id = taxon_id(0)
)
unidentified_animal <- hierarchy(mammalia, unidentified)
unidentified_plant <- hierarchy(plantae, unidentified)
x <- taxonomy(unidentified_plant, unidentified_animal)

> x
<Taxonomy>
  3 taxa: 1. Plantae, 2. unidentified, 3. Mammalia
  2 edges: 1->2, 3->2

Note how the two "unidentified" taxa was combined into one. With the new version the output is:

> x
<Taxonomy>
  4 taxa: 1. Plantae, 2. Mammalia, 3. unidentified, 4. unidentified
  4 edges: NA->1, NA->2, 1->3, 2->4

Oops, posted comment before done writing... one sec...

Also, note the NA as the root. This makes it so there is one edge per unique taxon, so they can be indexed the same. I found this makes certain operations much faster. We might want to remove the NA from the print output though?

I also implemented the subtaxa function and made supertaxa and roots work right. You can test them out by using the code in "test--taxonomy.R" in the tests folder.

Next steps:

[x] make real tests (just code to play with in there now, no "expect_*" calls)
[ ] add examples in docs
[x] replace the index option with something like return_type that takes the following values: "index", "id", "taxa", or "hierarchies"

sckott commented 7 years ago

thanks for your work on this. looks good to me. I ran through the tests, looks good.

We might want to remove the NA from the print output though?

i guess we'd need something to replace that though, what would it be?

I can help with adding examples if you like

zachary-foster commented 7 years ago

i guess we'd need something to replace that though, what would it be?

Im not sure. If we put something like "(root)" it could confuse people into thinking that there is a taxon called "(root)". Maybe is makes sense how it is.

Another bit of weirdness to consider is that the "index" for the return_type is the index of the edge list, not the taxon list and they do not need to be in the same order. This is unintuitive, but is more useful since most computationally-intensive operations work with the edgelist, not the taxon list.

I can help with adding examples if you like

Sure, that would be great. I am not sure how to best handle the examples since it takes a lot of setup code.

sckott commented 7 years ago

I think sticking with NA is good (we can revisit if needed)

For examples, we can have some objects that are loaded with the package that are the things that you'd need to do examples, so that we don't have to include all the code for that setup. And we can point to man file(s) for the data object for how to create that object (data need to have man files anyway, so that works out)

zachary-foster commented 7 years ago

That sounds like a good plan to me

sckott commented 7 years ago

@zachary-foster okay if I close this? i think the only thing left here is to add examples

sckott commented 7 years ago

@zachary-foster ^^

zachary-foster commented 7 years ago

@sckott oh yea, you can close it.

sckott commented 7 years ago

thx

ropensci / taxa

Adapt `taxmap` utility functions to `taxonomy` #9