grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

Metacoder observation table names, calculation functions, and dplyr style additions to the package? #254

Open grabear opened 5 years ago

grabear commented 5 years ago

Introduction

I'm not sure if this is relevant for all of your workflows, but I was thinking about the naming conventions used in metacoder and phyloseq.

For the project I'm working on we decided to go with a different naming convention for the observation data (aka names(mc_obj$data)). Our project imports the data using phylsoeq and then converts it with parse_phyloseq, so the context may be limited to this function:

# The metacoder_obj$data will contain the following tables.
# The keys here are the old phyloseq table names, and the values are the new table names
...
otu_table: "otu_abundance"
tax_data: "otu_annotations"
sample_data: "sample_data"
phy_tree: "phy_tree"

(Note: _The "otuannotations" table makes more sense to with respect to https://github.com/grunwaldlab/metacoder/pull/253 After calculation, I decided to use these other naming conventions to match what we had above, and they seem to be more intuitive.

# These new tables are created by metacoder's calculation functions
...
calc_taxon_abund(otu_abundance): "taxa_abundance"
calc_obs_props(otu_abundance): "otu_proportions"
calc_obs_props(otu_proportions): "taxa_proportions"
...

Questions

1. Do you like these ideas?

2. Would you consider manipulating the way the calc_* functions work?

3. Would you consider adding functionality to the calc_* functions so that they generate default observation table names based on verified data types (e.g. phyloseq)?

...
# if data = "otu_abund", then table_name = "taxa_abund"
calc_taxon_abund(otu_abund): "taxa_abund"
# if data = "otu_abund", then table_name = "otu_prop"
calc_obs_props(otu_abund): "otu_prop"
# if data = "otu_prop", then table_name = "taxa_prop"
calc_obs_props(otu_prop): "taxa_prop"
...

Final

I can work on most of these items on my fork.

zachary-foster commented 5 years ago

Thanks @grabear, I will try to look at this more closely this weekend and get back to you.

grabear commented 5 years ago

Follow Up

I'm working on some of this in a private repository for one of my projects. Maybe when we finish the package I can present to you what I've done and we can work from there?

The package will also include a workflow that utilizes otu_ids https://github.com/grunwaldlab/metacoder/pull/253, correlation plots and agglomeration function from https://github.com/grunwaldlab/metacoder/issues/234, and adds other phyloseq style data filtering functions for metacoder objects:

I've also created some functions based on my suggestions above:

New Items

4. Would you consider letting me add some calc_*/dplyr-style functions that take a function as a parameter and allows you to transform the table based on that function?

5. Would you consider adding obj$otu_id() or obj$alt_id() to taxa::all_names()?

zachary-foster commented 5 years ago

1) renaming parse_phyloseq output

"otu_table", "sample_data", "tax_table", and "phy_tree" were chosen because those are the names used in a phyloseq object. I agree that "otu_abund" is better than "otu_table", but the rest seem fine as they are. I dont use phyloseq too much, but I assumed people would be less confused if the names stayed the same?

2) Would you consider manipulating the way the calc_* functions work?

Yea, the obj$data$my_table <- thing gets old. Sure, calc_*(..., out = "my_table") sounds good. It still should return the created table though, but with invisible(), so you don't see it printed on the screen, since I would like the return type to be consistent.

3)

This relates to 2 as well. If we added an option like that above, then there could not be a default for table name, since the default would be to not add the table, but return it, like is currently done. I kind of like forcing the user to come up with their own name in this case. If we added an R6 method, there could be a default table name:

obj$calc_*()

taxa has R6 variants of all its functions for modifying data without needing to use the returned value, but I think that might confuse the average R user. See https://adv-r.hadley.nz/r6.html#adding-methods-after-creation. If there was a default table name, I would like it to be either always the same or add a consistent suffix to the input table name. Ideally the user should be choosing the name anyway, so they make something that makes sense to them.

I'm working on some of this in a private repository for one of my projects. Maybe when we finish the package I can present to you what I've done and we can work from there?

You have them in another R package? That sounds good. Yea, once you have them ready, let me know and I will look at them. We can then either add them to metacoder, or leave them in another package.

I've also created some functions based on my suggestions above:

Those sound useful, but perhaps too workflow-specific. I am trying to keep metacoder like a tool kit and not hold the users hand too much. Perhaps those and other such functions could go in a microbiomeWorkflows package that focuses on quickly making workflows using metacoder and phyloseq for microbiome projects? I would be interested in helping with something like that, but I probably wont make it myself any time soon, since I am more interested in making tools than workflows.

4) metacoder::calc_stat?

Turns out I already have a function called metacoder::calc_group_stat, which does per-row calculations with optional grouping, given a function as an argument. I do not have a function to do per-column transformations, like calc_obs_props but more abstracted and takes a function as an option. How about this:

5)

You should be able to get that info from all_names already, if those are columns in a table. All column names in all tables are in all_names(). Does this work for you or is there a bug somewhere?

Thanks!