Open grabear opened 5 years ago
Thanks @grabear, I will try to look at this more closely this weekend and get back to you.
I'm working on some of this in a private repository for one of my projects. Maybe when we finish the package I can present to you what I've done and we can work from there?
The package will also include a workflow that utilizes otu_ids https://github.com/grunwaldlab/metacoder/pull/253, correlation plots and agglomeration function from https://github.com/grunwaldlab/metacoder/issues/234, and adds other phyloseq style data filtering functions for metacoder objects:
I've also created some functions based on my suggestions above:
format_metacoder_object
renames tables, and creates the proportions and taxa tablesvalidate_metacoder_object
validates that the metacoder object has been formatted
phyloseq::transform_sample_counts
, phyloseq::filter_taxa
, and metacoder::calc_obs_props
, but instead taxa::transform_obs
or metacoder::calc_obs_trans
that allows you to transform rowwise (OTUs/taxon_id) or columnwise (samples)
metacoder::calc_stat
for rowwise/columnwise calculations.obj$otu_id()
or obj$alt_id()
to taxa::all_names()
?parse_phyloseq
output"otu_table", "sample_data", "tax_table", and "phy_tree" were chosen because those are the names used in a phyloseq object. I agree that "otu_abund" is better than "otu_table", but the rest seem fine as they are. I dont use phyloseq too much, but I assumed people would be less confused if the names stayed the same?
Yea, the obj$data$my_table <-
thing gets old. Sure, calc_*(..., out = "my_table")
sounds good. It still should return the created table though, but with invisible()
, so you don't see it printed on the screen, since I would like the return type to be consistent.
This relates to 2 as well. If we added an option like that above, then there could not be a default for table name, since the default would be to not add the table, but return it, like is currently done. I kind of like forcing the user to come up with their own name in this case. If we added an R6 method, there could be a default table name:
obj$calc_*()
taxa
has R6 variants of all its functions for modifying data without needing to use the returned value, but I think that might confuse the average R user. See https://adv-r.hadley.nz/r6.html#adding-methods-after-creation. If there was a default table name, I would like it to be either always the same or add a consistent suffix to the input table name. Ideally the user should be choosing the name anyway, so they make something that makes sense to them.
I'm working on some of this in a private repository for one of my projects. Maybe when we finish the package I can present to you what I've done and we can work from there?
You have them in another R package? That sounds good. Yea, once you have them ready, let me know and I will look at them. We can then either add them to metacoder, or leave them in another package.
I've also created some functions based on my suggestions above:
Those sound useful, but perhaps too workflow-specific. I am trying to keep metacoder like a tool kit and not hold the users hand too much.
Perhaps those and other such functions could go in a microbiomeWorkflows
package that focuses on quickly making workflows using metacoder and phyloseq for microbiome projects?
I would be interested in helping with something like that, but I probably wont make it myself any time soon, since I am more interested in making tools than workflows.
metacoder::calc_stat
?Turns out I already have a function called metacoder::calc_group_stat
, which does per-row calculations with optional grouping, given a function as an argument. I do not have a function to do per-column transformations, like calc_obs_props
but more abstracted and takes a function as an option. How about this:
calc_group_stat
): operates on rows, possibly grouped by column attributes. The user-supplied function takes multiple values from a single row and returns a single value. Example usage: sum OTU counts by sample type. rbind
ed together with a column identifying the source OTU table (grouped by OTU table source). You should be able to get that info from all_names
already, if those are columns in a table. All column names in all tables are in all_names()
. Does this work for you or is there a bug somewhere?
Thanks!
Introduction
I'm not sure if this is relevant for all of your workflows, but I was thinking about the naming conventions used in metacoder and phyloseq.
For the project I'm working on we decided to go with a different naming convention for the observation data (aka
names(mc_obj$data)
). Our project imports the data using phylsoeq and then converts it withparse_phyloseq
, so the context may be limited to this function:(Note: _The "otuannotations" table makes more sense to with respect to https://github.com/grunwaldlab/metacoder/pull/253 After calculation, I decided to use these other naming conventions to match what we had above, and they seem to be more intuitive.
Questions
1. Do you like these ideas?
2. Would you consider manipulating the way the
calc_*
functions work?I don't want to change the underlying functionality, I just want to add an additional way to direct the output:
3. Would you consider adding functionality to the
calc_*
functions so that they generate default observation table names based on verified data types (e.g. phyloseq)?parse_phyloseq
in the short term.Final
I can work on most of these items on my fork.