nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 127 forks source link

Frequencies JSON spec #84

Open trvrb opened 6 years ago

trvrb commented 6 years ago

PR #83 surfaced an issue of JSON format for frequencies data. Auspice frequencies panel wants just tip frequencies to work and rather than try to shoe-horn tips into the previous frequencies data model, it felt better to make something more targeted. However, @huddlej brought up the issue that it's going to be really annoying supporting two different frequencies formats. This is my attempt to identify a merged JSON format that works just for _tipfrequencies.json but also works for the combined _frequencies.json with all nodes in the tree, annotated clades and mutations.

Current _frequencies.json is almost completely flat:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "global_HA1:40N": [ 0.0003, 0.0002, ...],
  "north_america_clade:2024": [ 0.0, 0.0, ...],
  "africa_6b.1": [ 0.1069, 0.1075, ...],
  ...
}

You have to know how the keys are constructed. Each begins with a region code (NA, africa, etc...) followed by _ then one of three things:

  1. A mutation, coded as protein (HA1, SigPep, etc...) followed by : followed by mutation (40N, etc...)
  2. A generic clade in the tree with clade:2024 indicating clade 2024 in the tree.
  3. Annotated clade in the tree like 6b.1 or 3c3.a.

Currently proposed _tipfrequencies.json has a bit more structure:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "A/Dnipro/768/2016": {
    "frequencies": [0.004389,  0.004405, ...],
    "weight": 1.0
  }
}

My proposed flat spec would be:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "HA1:40N": {
    "frequencies": [ 0.0003, 0.0002, ...],
    "region": "global"
  },
  "clade:2024": {
    "frequencies": [ 0.0, 0.0, ...],
    "region": "north_america"
  },
  "6b.1": {
    "frequencies": [ 0.1069, 0.1075, ...],
    "region": "africa"
  },
  "A/Dnipro/768/2016": {
    "frequencies": [0.004389,  0.004405, ...],
    "weight": 1.0
  },
}

This makes it easy to reach into _frequencies.json and grab tip by strain name, grab clade by clade index, grab annotated clade by clade_annotation or grab mutation by mutation name.

Note that the flat format makes it difficult to just from _frequencies.json collect all, say, annotated clades. This would require adding hierarchy or marking these with something like anno:6b.1 and strain:A/Dnipro/768/2016 (but I think hierarchy is significantly better than this labeling).

My proposed hierarchical spec would be:

{
  "pivots": [ 2015.0833, 2015.1667, ... ],
  "mutations": {
    "HA1:40N": {
      "frequencies": [ 0.0003, 0.0002, ...],
      "region": "global"
    },
   },
  "clades": {
    "2024": {
      "frequencies": [ 0.0, 0.0, ...],
      "region": "north_america"
    },
  },
  "annotations": {
    "6b.1": {
      "frequencies": [ 0.1069, 0.1075, ...],
      "region": "africa"
    },
  },
  "tips": {
    "A/Dnipro/768/2016": {
      "frequencies": [0.004389,  0.004405, ...],
      "weight": 1.0
    },
  },
}

In this case _tipfrequencies.json gains a tips dictionary at the top-level as well and corresponds to a simple subset of what's in _frequencies.json.

The latter hierarchical spec seems more self-documenting to me. I might slightly prefer it, but it's a mild preference. Implementing either spec in nextflu/auspice would be very easy.

Can I get votes from people that work with frequencies, ie. @sidneymbell, @huddlej, @jameshadfield and @rneher for which they prefer?

jameshadfield commented 6 years ago

I don't use the frequencies JSON for analysis, only parsing in auspice, so I won't weigh in on the format. The only relevant input I have is that there be a auspice-specific JSON to reduce the amount of data downloaded when visiting nextstrain.org, but this could easily be either of the proposed specs restricted to tips.

Also, tipfrequencies.json is now tip-frequencies.json as per @huddlej's PR review.

sidneymbell commented 6 years ago

I'd have a strong preference for working with the hierarchical format for downstream analyses. Parsing the current flat format drives me bonkers :)

A few thoughts:

rneher commented 6 years ago

I am fine with a hierarchical format, but I am puzzled by this part of the proposal:

  "clades": {
    "2024": {
      "frequencies": [ 0.0, 0.0, ...],
      "region": "north_america"
    },
  },

how are different regions supposed to be distinguished? doesn't this require another level?

rneher commented 6 years ago

haven't gotten to the bottom of the auspice frequency plotting, but I don't fully understand how frequencies of tips should be interpreted.

huddlej commented 6 years ago

I also prefer the hierarchical spec for the self-documenting and interpretability benefits. I had the same question as @rneher about how regions would be handled though. Maybe something like:

  "regions": ["global", "north_america", ...],
  "clades": {
    "2024": {
      "north_america": [ 0.0, 0.0, ...],
      "global": [ 0.0, 0.0, ...]
    },
  },

I can't remember what the tip weight does. Will the tip frequencies always be global clade frequencies, but the weight will allow us to recalculate weighted frequencies by region (or some alternate characteristic)?

huddlej commented 6 years ago

While thinking about how to serialize frequencies from the KdeFrequencies class to JSON for import and export, I spent some time also thinking about the ideal interface to the class itself that would be the most compatible with serialization. In its simplest definition, the frequencies class needs to accept zero or more user parameters for KDEs, weighting, and censoring and use those parameters to estimate node frequencies for a given tree.

The frequencies interface must work for the following tools:

The frequencies class should also address the following usability issues related to the tools listed above.

To address these constraints, a frequencies JSON file should include at least the parameters used to generate frequencies and ideally also the calculated pivots and frequencies corresponding to those parameters and a given tree. If the pivots and frequencies data are not given in the JSON, it should be possible to recalculate them from the original tree (Newick + metadata JSON) and frequencies parameters. I propose the following schema for the frequencies JSON.

{
    "parameters": {
        "sigma_narrow": 0.08,
        "sigma_wide": 0.3,
        "proportion_wide": 0.25,
        "pivot_count": 24,
        "pivot_frequency": 0.08,
        "weights": {
            "africa": 1.02,
            ...
        },
        "weights_attribute": "region",
        "max_date": 2017.0,
        "include_internal_nodes": false
    },
    "data": {
        "pivots": [],
        "frequencies": {
            "global": {
                "strain_1": [],
                "strain_2": [],
                ...
            }
            "china": {
                "strain_1": [],
                "strain_2": [],
                ...
            }
        }
    }
}

Importantly, this schema highlights that frequencies are defined by their parameters and an accompanying tree. This schema augments an existing tree without the redundancy of replicating that tree in JSON format. The data serve as a payload for external applications such as auspice that do not have access to the augur Python code.

Along with this JSON schema, I propose the following API for the KdeFrequencies class that will make it easy to produce the above JSON output and also address the interface concerns listed above. This API is partially inspired by the standard scikit-learn interface through which users define model parameters by instantiating a model class and fit a model to some given input data. In the case of frequencies, the parameters are all of the settings for the KDE, pivots, and weighting and the data are from a tree to estimate frequencies for.

The simplest interface would look like this.

#
# Assumes an annotated Bio.Phylo tree has been loaded.
#

# Instantiate a frequencies object with default parameters.
kde_frequencies = KdeFrequencies()

# Estimate frequencies from the given tree.
# Stores the results in a `frequencies` attribute of the
# instance to enable easy exporting to JSON and returns
# the results to the caller, too, to enable more functional
# programming.
# Each call to `estimate` overwrites previous `frequencies`
# data.
frequencies = kde_frequencies.estimate(tree)

The interface for weighted frequencies would look like this.

# Define weights for tips by region.
# These could be also provided to the augur command line tool
# as a simple dictionary in an additional JSON file or through
# an existing node metadata JSON.
weights = {
    "africa": 1.02,
    "europe": 0.74,
    "north_america": 0.54,
    ...
}

# Estimate frequencies with tips weighted by region attribute.
# Any key in `node.attr` is a valid weight attribute.
# Weighting is not enabled by default because weighting data
# are likely to be pathogen-specific.
kde_frequencies = KdeFrequencies(
    weights=weights,
    weights_attribute="region"
)
# The `estimate` interface abstracts away the complexities
# of calculating unweighted or weighted frequencies that
# the user should not be concerned about.
weighted_frequencies = kde_frequencies.estimate(tree)

The interface to (re)calculate censored frequencies would look like this.

# Specify a maximum date for tips to use in estimation.
kde_frequencies = KdeFrequencies(max_date=2017.00)
censored_frequencies = kde_frequencies.estimate(tree)

# Recalculate frequencies with a different date threshold
# as in the fitness model.
# New instances are cheap to create and prevent unwanted
# side-effects associated with changes to an existing
# instance.
# New frequencies are quick to recalculate.
kde_frequencies = KdeFrequencies(max_date=2018.00)
censored_frequencies = kde_frequencies.estimate(tree)

The interface to change default KDE parameters will look like this.

# Calculate frequencies every 1/4 of a year,
# use a narrow bandwidth of 1 month,
# use a wide bandwidth of 3 months,
# add 25% of the wide bandwidth to the total bandwidth,
# and only calculate frequencies for tips of the tree.
kde_frequencies = KdeFrequencies(
    pivot_frequency=0.25,
    sigma_narrow=1 / 12.0,
    sigma_wide=3 / 12.0,
    proportion_wide=0.25,
    include_internal_nodes=False
)
frequencies = kde_frequencies.estimate(tree)

Import and export frequencies:

# Serialize the frequencies instance in JSON format.
# This is a Python dictionary that can be parsed by
# `json.dump`.
frequencies_json = kde_frequencies.to_json()

# Instantiate a frequencies object from a given frequencies
# JSON exported from a previous instance.
# If pivots and frequencies are defined in the `data`
# section of the JSON, their values will be assigned to the
# new instance.
kde_frequencies = KdeFrequencies.from_json(frequencies_json)

The following provides a full example with tree loading (from an annotated auspice tree), custom parameters, weighted frequencies, and censoring.

# Import base Python modules.
import json

# Import augur classes.
from base.frequencies import KdeFrequencies
from base.io_util import json_to_tree

# Import pathogen-specific data.
from builds.flu.flu_info import regions

# Load a tree.
with open("flu_seasonal_h3n2_ha_2y_tree.json", "r") as fh:
    tree_json = json.load(fh)

tree = json_to_tree(tree_json)

# Define weights for tips based on region name and
# population size in billions.
weights = {region[0]: region[2] for region in regions}

# Estimate tip frequencies weighted by region.
kde_frequencies = KdeFrequencies(
    pivot_frequency=0.25,
    sigma_narrow=1 / 12.0,
    sigma_wide=3 / 12.0,
    proportion_wide=0.25,
    include_internal_nodes=False,
    weights=weights,
    weights_attribute="region"
)
frequencies = kde_frequencies.estimate(tree)

# Export frequencies and associated parameters to JSON.
frequencies_json = kde_frequencies.to_json()
with open("frequencies.json", "w") as fh:
    json.dump(frequencies_json, fh)

# Import frequencies from existing tree and JSON.
new_kde_frequencies = KdeFrequencies.from_json(frequencies_json)
new_frequencies = new_frequencies.estimate(tree)

@trvrb and @rneher, what do you think of this proposed schema and API change?