Closed zachary-foster closed 7 years ago
do you mind adding me as a collaborator
you should be able to now
Great, thanks!
@sckott, I am trying to implement this and I am not sure the best way to go about it. I can think of the following options:
subtaxa
, supertaxa
etc. R6 methods of taxonomy
. e.g. data$subtaxa()
subtaxa
, supertaxa
etc. S3 generics functions with a method for taxonomy
. e.g. subtaxa(data)
. I assume this is possible? Things to keep in mind:
taxmap
will need the same functions. If we have taxmap
inherit from taxonomy
, option 1, 3, or 4 will handle this elegantly (i think?).subtaxa(data)
than data$subtaxa()
What do you think? Should these functions even be in this package, or do you want to have them in an associated manipulation package. These seem like basic manipulations to me and I lean towards having them in this package.
So, I’m thinking of replacing the index option with something like return_type that takes the following values: "index", "id", "taxa", or "hierarchies"
I like and support this
Are you leaning one way or the other wrt S3 vs. R6?
What do you think? Should these functions even be in this package, or do you want to have them in an associated manipulation package. These seem like basic manipulations to me and I lean towards having them in this package.
I think they should be in here. remember, I want to merge this with binomen, so we have sort of equivalent functionality with taxmap and taxa classes
Are you leaning one way or the other wrt S3 vs. R6?
I am thinking that option 3 is probably the best, but I don’t have much experience with R6 so it is hard for me to choose between 3 and 4. I think I will start implementing option 3 and see how that goes. In general, it might be nice to haves S3 wrappers for the most important R6 functions.
That sounds good to me. We can rework if we need to as we're still pre-CRAn
@sckott, I tested option 3 by adapting supertaxa
to Taxonomy
on the issue-9
branch. I had problems getting the roxygen documentation to pass R CMD check. For example, I could not get the usage to show both ways of using the function. It seems to be working, but its a little awkward to document.
Im not sure that calling the S3 method from R6 would be better; in that case, we would not be able to use private methods of Taxonomy
without passing the private
list to the supertaxa
S3 method, which would be confusing to users, since the private
parameter would have to be documented (i think). I tried this as well in branch issue-9-alternate
so we can compare the two.
Can you look at the issue-9
and issue-9-alternate
branches and see which you like better? Once we pick a strategy, I will implement the rest of the functions.
@zachary-foster having a look. Curious, why did you remove all the tests?
I did it temporarily to pass R CMD check since some of the changed I made broke quite a few of them. All the tests so far were for taxmap
and will have to be mostly rewritten anyway once that class is reworked. I still have them in metacoder
and the master branch and I will use them as a template when it comes time to make tests for the reworked taxmap
.
And any examples of usage of supertaxa?
You can take the taxonomy
example and use supertaxa
like so:
...
...
x <- taxonomy(hier1, hier2, hier3)
x$supertaxa()
supertaxa(x)
I had not decided how best to provide examples yet. We probably should have an example taxonomy
dataset for examples.
Okay on tests, sounds fine.
Thanks for e.g.
We probably should have an example taxonomy dataset for examples.
Yes.
I think I prefer the version on branch issue-9
Thats the way I was leaning to. I will continue work on that branch then. Thanks for looking it over!
cool^3
@zachary-foster any progress on this?
Not too much unfortunately. I ran into a difference in how the edge list is constructed using your algorithm vs mine, which made these functions harder to adapt that I expected and I have not had enough time to think about this enough. Basically, my functions assume there is one edge per taxon, even if there is no edge leading to it (They connect to NA
in that case). This allows the edge list to be indexed the same as any attributes associated with taxa (as in the taxmap object). Its also allows for quick identification of roots by testing something like is.na(edgelist$from)
rather than edgelist$to %in% edgelist$from
, which is an all-vs-all comparison and slow on large trees. Finding roots is important for optimizing some of the algorithms that traverse the tree.
Sorry for the delay, I will try to get back to this soon. I have had a series of other things to work on that have been taking up all my time; I am a phd student currently and I have a lot of projects going on right now. Since we have written the publication for metacoder, my advisor no longer considers this a priority so all the work on taxa
has been on my free time, which has become scarce lately.
Okay, thanks for the explanation. Let me know if I can help change anything to make this easier.
Totally understand about time constraints.
@zachary-foster if okay with you, i want to move forward with this pkg.
I can try to do the changes you were going to do, but if I can't quite understand, I'll see about putting those changes on hold until you can get around to them while still releasing a first version to CRAN that will easily allow the future changes.
@sckott, sorry for the inactivity. I looked at this issue for a bit today and I think I have a solution to the edge list issue. I am trying it out now. I am ok with an adaptable CRAN release if we cant get this working soon.
okay, thanks @zachary-foster
Hi @sckott, I think I have implemented a decent solution to the edge list problem. So the algorithm you had there worked in most cases, but did not differentiate taxa with the same name, but different lineages:
mammalia <- taxon(
name = taxon_name("Mammalia"),
rank = taxon_rank("class"),
id = taxon_id(9681)
)
plantae <- taxon(
name = taxon_name("Plantae"),
rank = taxon_rank("kingdom"),
id = taxon_id(33090)
)
unidentified <- taxon(
name = taxon_name("unidentified"),
rank = taxon_rank("species"),
id = taxon_id(0)
)
unidentified_animal <- hierarchy(mammalia, unidentified)
unidentified_plant <- hierarchy(plantae, unidentified)
x <- taxonomy(unidentified_plant, unidentified_animal)
> x
<Taxonomy>
3 taxa: 1. Plantae, 2. unidentified, 3. Mammalia
2 edges: 1->2, 3->2
Note how the two "unidentified" taxa was combined into one. With the new version the output is:
> x
<Taxonomy>
4 taxa: 1. Plantae, 2. Mammalia, 3. unidentified, 4. unidentified
4 edges: NA->1, NA->2, 1->3, 2->4
Oops, posted comment before done writing... one sec...
Also, note the NA
as the root. This makes it so there is one edge per unique taxon, so they can be indexed the same. I found this makes certain operations much faster. We might want to remove the NA
from the print output though?
I also implemented the subtaxa
function and made supertaxa
and roots
work right. You can test them out by using the code in "test--taxonomy.R" in the tests folder.
Next steps:
index
option with something like return_type
that takes the following values: "index", "id", "taxa", or "hierarchies"thanks for your work on this. looks good to me. I ran through the tests, looks good.
We might want to remove the NA from the print output though?
i guess we'd need something to replace that though, what would it be?
I can help with adding examples if you like
i guess we'd need something to replace that though, what would it be?
Im not sure. If we put something like "(root)" it could confuse people into thinking that there is a taxon called "(root)". Maybe is makes sense how it is.
Another bit of weirdness to consider is that the "index" for the return_type
is the index of the edge list, not the taxon list and they do not need to be in the same order. This is unintuitive, but is more useful since most computationally-intensive operations work with the edgelist, not the taxon list.
I can help with adding examples if you like
Sure, that would be great. I am not sure how to best handle the examples since it takes a lot of setup code.
I think sticking with NA
is good (we can revisit if needed)
For examples, we can have some objects that are loaded with the package that are the things that you'd need to do examples, so that we don't have to include all the code for that setup. And we can point to man file(s) for the data object for how to create that object (data need to have man files anyway, so that works out)
That sounds like a good plan to me
@zachary-foster okay if I close this? i think the only thing left here is to add examples
@zachary-foster ^^
@sckott oh yea, you can close it.
thx
The following functions used to get information from
taxmap
objects should be easily adaptable totaxonomy
objects:One question is, what should these return by default? Currently, they return either indexes or IDs, depending on the
index
option. Indexes are useful because they are the fastest way to access information and make a big difference for large datasets, but they don’t respond to changes to the taxonomy. IDs are useful because they are not affected by changes to the taxonomy and can be mapped to other objects/tables. Now that there aretaxon
objects, we have the option of returningtaxon
objects or ataxa
object. So, I’m thinking of replacing theindex
option with something likereturn_type
that takes the following values:"index"
,"id"
,"taxa"
, or"hierarchies"
. I have found being able to choose ID or indexes very useful when using these functions.By the way @sckott, do you mind adding me as a collaborator for this repository so I can add issue tags and assign myself to things? I wanted to assign myself this issue for example and I dont think I can as it is.