qiime2 / q2-diversity

BSD 3-Clause "New" or "Revised" License
4 stars 45 forks source link

ENH: Route all diversity calculations through diversity-lib #281

Closed ChrisKeefe closed 4 years ago

ChrisKeefe commented 4 years ago

Closes #280. See also qiime2/q2-diversity-lib#21

Closes #284.

Note: Renames "weighted_unifrac" to "weighted_unnormalized_unifrac", aligning q2-diversity with q2-diversity-lib and the underlying unifrac package. This is a breaking API change.

ChrisKeefe commented 4 years ago

This tests clean locally, but tests will continue to fail here until https://github.com/qiime2/q2-diversity-lib/pull/21 is merged.

ChrisKeefe commented 4 years ago

This PR is ready for a preliminary review - does anything here need major course correction?

Only beta_phylogenetic has been converted into a pipeline at this time. I'll continue transitioning other Actions on a separate branch while this is in review, so this doesn't become a moving target.

ChrisKeefe commented 4 years ago

Edit: After beginning work on dispatch for nonphylogenetic beta diversity measures, I've decided to move dispatch from diversity-lib to diversity, preventing some gross circular dependency action. Removed this from In Review until that's wrapped up.

Cancel that. After a light informal review and discussion with @thermokarst , I'm going to carry on porting all diversity calculations over to diversity-lib, and handling dispatch there.

thermokarst commented 4 years ago

Executive Summary

Thanks @ChrisKeefe, these q2-diversity/q2-diversity-lib changes are looking great! After spending some time going through these PRs, I have a handful of thoughts, which I will try to outline here.

TLDR: I think there is at least one too many levels of indirection here, and with a little restructuring this code can be simplified conceptually and structurally.

Overall structuring

The role of the q2-diversity-lib plugin is to serve as library code for q2-diversity. The q2-diversity-lib code could just as easily live inside of q2-diversity itself, but we chose to factor it out as a plugin for a few reasons, one of them being just for the sake of giving you an opportunity to make a plugin from scratch. With that in mind, I think it is important to emphasize that the use of this q2-diversity-lib library code does not need to be minimized in q2-diversity, in fact, I think that minimizing/masking q2-diversity-lib invocations in q2-diversity is problematic.

If we emphasize that the actions available in q2-diversity-lib are meant to compose together into bigger pipelines in q2-diversity, we get the best of both worlds - compartmentalization of code (in q2-diversity-lib) and easy to follow pipelines (in q2-diversity).

With that in mind, generally speaking:

q2-diversity-lib

Mappings

The proliferation of data-lookups-as-utility-functions-but-only-sometimes is a bit confusing:

Instead of this, I propose one coherent data structure that encodes all of this information in one pass:

MAPPINGS = {
    'PHYLO': {
        'IMPL': {'faith_pd'},
        'UNIMPL': set(),
    },
    'NONPHYLO': {
        'IMPL': {'observed_features', 'pielou_e', 'shannon'},
        'UNIMPL': {'ace', 'chao1', 'chao1_ci', 'berger_parker_d',
                   'brillouin_d', 'dominance', 'doubles', 'enspie', 'esty_ci',
                   'fisher_alpha', 'goods_coverage', 'heip_e',
                   'kempton_taylor_q', 'margalef', 'mcintosh_d', 'mcintosh_e',
                   'menhinick', 'michaelis_menten_fit', 'osd', 'robbins',
                   'simpson', 'simpson_e', 'singles', 'strong', 'gini_index',
                   'lladser_pe', 'lladser_ci'},
    },
}

This lends itself well to "natural reading", especially when chaining together:

metrics = MAPPINGS['NONPHYLO']['IMPL'] | MAPPINGS['NONPHYLO']['UNIMPL']
implemented_metrics = MAPPINGS['NONPHYLO']['IMPL']

Also, this has the advantage of only exporting one object, rather than a half dozen loosely related ones.

You also might've noticed that there aren't dictionaries in the tips, I don't think you'll need to include those dicts, and hopefully you'll see why, elsewhere in this review. Right now I am envisioning one MAPPINGS in the alpha module, and one in beta.

Also, I'm open to other names (METRICS, for example), and different layouts of the data, this is just to get you started with some ideas.

Translations

I like the idea of including translations, and I mention it here as a subsection of mappings, but only because I see it as a loosely related idea. I think you will be able to move any translation functionality into the q2-diversity Pipelines, discussed later in this review.

Methods

I think the following QIIME 2 Methods should be exposed in this plugin (will reference the path to the view function, for clarity):

You'll notice there is no alpha_phylogenetic_passthrough Method, only because there are no unimplemented alpha phylogenetic metrics in q2-diversity.

alpha_passthrough

This method will handle just the MAPPINGS['NONPHYLO']['UNIMPL'] metrics, and might look something like this:

@_disallow_empty_tables
def alpha_passthrough(table: biom.Table, metric: str) -> pd.Series:
    counts = table.matrix_data.toarray().astype(int).T
    sample_ids = table.ids(axis='sample')

    result = skbio.diversity.alpha_diversity(metric=metric, counts=counts,
                                             ids=sample_ids)
    result.name = metric
    return result

beta_passthrough

This method will handle just the MAPPINGS['NONPHYLO']['UNIMPL'] metrics.

Same idea as alpha_passthrough above - defer to skbio, don't worry about fancy signatures, citations, etc. Basically just moving the beta method over from q2-diversity.

beta_phylogenetic_passthrough

This method will handle just the MAPPINGS['PHYLO']['UNIMPL'] metrics.

Same idea as alpha_passthrough above - defer to skbio, don't worry about fancy signatures, citations, etc. Basically just moving the beta_phylogenetic method over from q2-diversity.

Pipelines

There will be none in this plugin. That means removing:

The bodies of these pipelines will be moved to q2-diversity.

Pipelines as Python Functions

There will be none in this plugin. That means removing:

Visualizers

There will be none in this plugin.

q2-diversity

Methods

Several existing methods will be converted to Pipelines (see below).

The following Methods will be removed:

Pipelines

New pipelines (former Methods):

These pipelines can all basically have the same implementations as the *_dispatch pipelines that you added to q2-diversity-lib.

alpha

This is more-or-less a copy-paste from your current alpha_dispatch pipeline:

def alpha(ctx, table, metric, drop_undefined_samples=None):
    metrics = MAPPINGS['NONPHYLO']['IMPL'] | MAPPINGS['NONPHYLO']['UNIMPL']
    implemented_metrics = MAPPINGS['NONPHYLO']['IMPL']

    if metric not in metrics:
        raise ValueError("Unknown metric: %s" % metric)

    metric = ALPHA_TRANSLATION[metric]

    if metric in implemented_metrics:
        func = ctx.get_action('diversity_lib', metric)
        if 'drop_undefined_samples' in signature(func).parameters:
            func = partial(func, table=table,
                           drop_undefined_samples=drop_undefined_samples)
        else:
            if drop_undefined_samples:
                warnings.warn(f"The {metric} metric does not support dropping "
                              "undefined samples.")
            func = partial(func, table=table)
    else:
        func = ctx.get_action('diversity_lib', 'alpha_passthrough')
        func = partial(func, table=table, metric=metric)

    result = func()
    return tuple(result)

Pipelines as Python Functions

The following Pipelines shall be used as QIIME 2 Pipelines, as well as vanilla Python code:

This can be handled by building a dependency-injection mock, to fill in the ctx argument:

class PyContext:
    def __init__(self):
        self.mappings = {
            'diversity_lib': {
                'faith_pd': q2_diversity_lib.alpha.faith_pd,
                'observed_features': q2_diversity_lib.alpha.observed_features,
                'pielou_e': q2_diversity_lib.alpha.pielou_evenness,
                'shannon': q2_diversity_lib.alpha.shannon_entropy,
                'alpha_passthrough': q2_diversity_lib.alpha.alpha_passthrough,
                'bray_curtis': q2_diversity_lib.beta.bray_curtis,
                'jaccard': q2_diversity_lib.beta.jaccard,
                # ... etc etc etc
            },
        }

    def get_action(self, plugin, action):
        return self.mappings[plugin][action]

Visualizers

The following visualizers will need to be updated to invoke the "Pipelines as Python Functions" listed above:

Using the PyContext class above, we can "inject" the vanilla Python functions in:

...

ctx = PyContext()

for d, i in itertools.product(depth_range, iter_range):
    rt = rarefy(feature_table, d)
    for m in metrics:
        if m in phylogenetic_metrics():
            rt_p = transform(rt, to_type=BIOMV210Format,
                             from_type=biom.Table)
            vector = alpha_phylogenetic(
                    ctx=ctx, table=rt_p, metric=m, phylogeny=phylogeny)
        else:
            vector = alpha(ctx=ctx, table=rt, metric=m,
                           drop_undefined_samples=False)
        data[m][(d, i)] = vector
return data

...

Summary and Conclusion

Like I said above, this all looks great, and I think the main parts of the implementation are there, but I think we can do a lot to help streamline it and reduce the complexity of the Pipeline -> Pipeline -> Method workflow that is currently proposed. Let me know if you have any questions or comments. I'm sure I missed one or two things in here, but I have played with all of these proposed changes locally and can confirm that the plan proposed here works.

🦖

thermokarst commented 4 years ago

Going to merge now - travis is failing because the staging env is missing the latest div-lib. I tested locally and all seems in order, no need to wait 2-3 hrs for the new env.