malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
14 stars 24 forks source link

Add taxon and area parameters to plot_frequencies_time_series() #182

Open alimanfoo opened 2 years ago

alimanfoo commented 2 years ago

Allow to control which taxa and areas to show, without having to recompute the frequencies.

leehart commented 1 year ago

Some notes:

Here ds is an input param, an xarray.Dataset of variant frequencies:

cohort_vars = [v for v in ds if v.startswith("cohort_")]
df_cohorts = ds[cohort_vars].to_dataframe()

DataFrames for each cohort are concatenated:

dfs = []
  for cohort_index, cohort in enumerate(df_cohorts.itertuples()):
      ds_cohort = ds.isel(cohorts=cohort_index)
      df = pd.DataFrame(
          {
              "taxon": cohort.taxon,
              "area": cohort.area,
              "date": cohort.period_start,
              "period": str(
                  cohort.period
              ),  # use string representation for hover label
              "sample_size": cohort.size,
              "variant": variant_labels,
              "count": ds_cohort["event_count"].values,
              "nobs": ds_cohort["event_nobs"].values,
              "frequency": ds_cohort["event_frequency"].values,
              "frequency_ci_low": ds_cohort["event_frequency_ci_low"].values,
              "frequency_ci_upp": ds_cohort["event_frequency_ci_upp"].values,
          }
      )
      dfs.append(df)
  df_events = pd.concat(dfs, axis=0).reset_index(drop=True)

A query is applied to remove events with no observations:

df_events = df_events.query("nobs > 0")

I suppose we could exclude cohorts that don't match the specified taxon or area parameters from that concatenation, but it looks like that would still require us to compute the frequencies.

We need to make sure we use the appropriate data in calculations, e.g.:

frq = df_events["frequency"]
frq_ci_low = df_events["frequency_ci_low"]
frq_ci_upp = df_events["frequency_ci_upp"]
df_events["frequency_error"] = frq_ci_upp - frq
df_events["frequency_error_minus"] = frq - frq_ci_low

But maybe the idea is to simply toggle different taxon and areas just on the plot itself, somehow, without any refreshing or recomputing? (I don't understand yet.)