[Task] Differential abundance

openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics

MIT License

321 stars 81 forks source link

[Task] Differential abundance #140

Closed dburkhardt closed 2 months ago

dburkhardt commented 3 years ago

Describe the problem concisely. Quantifying changes in cell type abundance between datasets

https://www.biorxiv.org/content/10.1101/532846v4 https://www.biorxiv.org/content/10.1101/2020.11.23.393769v1.full

Propose datasets https://science.sciencemag.org/content/360/6392/981

Propose methods https://github.com/KrishnaswamyLab/MELD/ https://github.com/MarioniLab/miloR

Propose metrics MSE

dburkhardt commented 3 years ago

Things coming up as I work on this task:

Needed subtasks: e.g. in the current framework, I'm artificially enriching or depleting populations of cells and assigning to replicates / conditions based on the degree of enrichment. The difference between 2 or 3 conditions / replicates requires changing parameters on the experiment simulator function, and some methods may do better or worse with more/fewer conditions
- It would be nice if we could save fit operators on the dataset if that makes running code faster. E.g. for MELD, we build a graph over all cells regardless of the sample label, so you could reuse this graph if you wanted to run hundreds of simulated sample assignments on the same dataset.
  - One idea for this would be to store some data in the adata object to see if there's a fit graph stored in e.g. adata.uns["meld_graph"]
As this is an artificial data experiment, I want some way to show the data that we're testing against so people can see what's going on. Should this be in Jupyter notebooks?

LuckyMD commented 3 years ago

Another method for the mix: scCODA. There a simulation framework in there as well and Maren extensively looked for published datasets which could be used. So it wouldn't only need to be artificial. However, ground truth is an issue without simulation.

dburkhardt commented 3 years ago

Another method for the mix: scCODA. There a simulation framework in there as well and Maren extensively looked for published datasets which could be used. So it wouldn't only need to be artificial. However, ground truth is an issue without simulation.

Does scCODA provide an estimate for each cell or just for each cluster label? One issue here is that methods like MELD and MILO provide a different estimate for each cell (if I understand MILO correctly). If we use a simulation framework like is being used in the task/differentialabundance branch where the effect of the perturbation varies continuously, then I think scCODA won't do well simply because it's relying on cluster resolution.

emdann commented 3 years ago

Hi, just jumped on this one

One issue here is that methods like MELD and MILO provide a different estimate for each cell

Milo's Fold-change estimate is on a neighbourhood on the KNN graph, not on each single-cell. A cell specific score could be obtained by averaging the scores across neighbourhoods to which the cell belongs.

LuckyMD commented 3 years ago

I think we might need to find separate task definitions for cell-specific condition probabilities and cluster-wise differential composition testing. As a first brain-storm

Differential composition analysis (for cluster-wise analysis)
Manifold density analysis

Methods for the latter can also be applied to the former of course.

dburkhardt commented 3 years ago

Great meeting today! Uploading some notes from the call:

Meeting 1/8/2020

This task will require subtasks
Want a more non-linear way to generate differential abundance vectors
We want to have a good set of 1) inputs 2) outputs and 3) way to generate data from multiple models for each possible subtask
- @LuckyMD suggested that having a good way to generate ground truth may dictate whether or not a method gets included
@emdann suggested using dyngen with slightly different parameters to generate two single cell datasets with partial overlap
@wes-lewis suggested that we might want a task there the goal is to identify regions of differential abundance (maxima, minima)
Discussed how do we want to bias biological controls vs simulated ground truth controls and what bioloigcal experiments (growth factors, @MikeDMorgan suggested spike ins of cell lines (possible related), @LuckyMD suggested a competitive assay - can we estimate effect size at different points of time)

dburkhardt commented 3 years ago

In the interest of defining subtasks, I've come up with a description of the "single cell condition likelihood estimation" subtask that I've been working on (would love a better name).

Input

expression_matrix array-like, shape=(n_obs, n_var) - a single-cell expression matrix
condition_probability array-like, shape=(n_obs, n_conditions) - a collection of vectors over the cells that gives the probability that a given cell is assigned to each of n_conditions. Only used for metric evaluation
sample_labels array-like, shape=(n_obs,) - a sample label for each cell giving the condition / replicate for each cell

Output

condition_probability array-like, shape=(n_obs, n_conditions) - a collection of vectors over the cells that estimates the probability that a given cell is assigned to each of n_conditions.

Ground truth The goal of this task is to estimate the likelihood of a sample condition label given a cell's UMI count.

This task is calculated using experimental scRNA-seq data and then artificially creating sample labels over the dataset. The labels are creating using a ground-truth probability function that is smooth over the dataset. Labels indicate condition and replicate. In the current task definition, condition probabilities are different for each condition across the data, but replicate assignments are uniform random across the dataset. We then concatenate the condition and replicate labels to create a sample label, e.g. condition2.replicate3.

To calculate ground-truth probabilities that a given cell has a given condition label (i.e. was observed in a given condition), we take a convex combination of principal components of the data for each condition. These PC-combinations are then L1-normalized across samples to create a probability of each label given a cell (the probabilities of all the labels sums to 1 across conditions). Cells are then randomly assigned to each replicate.

emdann commented 3 years ago

Hey @dburkhardt and all I have a comment about the implementation of the treatment simulation. I see that at the moment we simulate an experimental design where for the same (biological?) replicate both conditions are present (see here). And indeed in the implementation of MELD the density estimates are normalized across samples from the same replicate to calculate the relative likelihood (here).

I wonder if this represents a realistic scenario. For example, in the case of comparing abundance between samples from healthy and diseased individuals we might have biological replicates for the same condition, but not across conditions. DA methods would be able to account for variation between replicates of the same condition, but not across conditions. Hence would it be more accurate for adata.uns["replicates"] to just store the sample info (e.g. condition2.replicate3)? I can think of very limited applications with a "paired-design" in single-cell analysis.

Does this make sense? I might just be misunderstanding the density to likelihood normalization step in MELD

MarcElosua commented 3 years ago

hi @dburkhardt and all,

This is a very cool and interesting project!

To quantify differential abundance between datasets you might also be interested in trying logistic regression to compute the OR between conditions and adjust for covariates.

I link a post by Valentine Svensson in which he shows how to use this implementation and provides a dataset and a twitter thread where there is some open discussion. Code and sample data for the post are here.

emdann commented 3 years ago

Thoughts around a second subtask, as proposed by @wes-lewis. I’ve highlighted some issues that IMO require some discussion.

Detection of perturbed subpopulations

Aim: we want to evaluate the ability of DA methods to detect subpopulations of cells that are similar in gene expression space and similarly enriched or depleted by an experimental perturbation. This subtask should evaluate the sensitivity to detect differential abundance in small subpopulations enriched or depleted upon perturbation (conversely, the task evaluating single-cell condition probability gives high scores to a method detecting all cells in a large cluster showing DA, but no cells in smaller clusters).

Input Considering a simple case with 2 experimental conditions 1.expression_matrix array-like, shape=(n_obs, n_var) - a single-cell expression matrix

sample_labels array-like, shape=(n_obs,) - a sample label for each cell giving the condition / replicate for each cell
perturbed_populations array-like, shape=(n_obs,) - a vector that assigns cells to a perturbed subpopulation. How should these partitions be defined? A simple option is to use the KNN graph over the whole dataset, remove edges between not DA cells (e.g. condition probability in [0.4, 0.6]) and between cells with opposite “direction of DA”, use new graph for clustering DA cells. IMO this works to evaluate sensitivity to small subpopulations, but not to define a “ground-truth partition”.
perturbation_effect: array-like, shape=(n_obs,) - this needs to store the DA effect size for each perturbed population detected (at least indicate enrichment/depletion/noDA based on a certain threshold).

Output: (dependent on choice of metrics)

Possible metrics:

Population-wise recall: fraction of true positives identified from each subpopulation defined in perturbed_populations. In this case the output could be a simple vector of DA labels for each single cell (i.e. enriched/depleted/notDA)
_Do we want to evaluate the ability to recover the same partitions as stored in perturbed_subpopulations?_ Incorporating some metric for clustering similarity? If we want to do this, then we should discuss how to generate the ground-truth partitions.

dburkhardt commented 3 years ago

Hey @dburkhardt and all I have a comment about the implementation of the treatment simulation. I see that at the moment we simulate an experimental design where for the same (biological?) replicate both conditions are present (see here). And indeed in the implementation of MELD the density estimates are normalized across samples from the same replicate to calculate the relative likelihood (here).

I wonder if this represents a realistic scenario. For example, in the case of comparing abundance between samples from healthy and diseased individuals we might have biological replicates for the same condition, but not across conditions. DA methods would be able to account for variation between replicates of the same condition, but not across conditions. Hence would it be more accurate for adata.uns["replicates"] to just store the sample info (e.g. condition2.replicate3)? I can think of very limited applications with a "paired-design" in single-cell analysis.

Does this make sense? I might just be misunderstanding the density to likelihood normalization step in MELD

Just to clarify here, @emdann, MELD does not require paired replicates, you can normalize the likelihoods however you want. The goal of the likelihood is to say "here is the PDF of ≥2 variables, where is each higher or lower than the other(s)?" I haven't yet written a pipeline for cases where you have an uneven number of treatment / control samples, but that's definitely handle-able and we should include.

wes-lewis commented 3 years ago

My apologies for not adding my thoughts until now! Another method that I'm eager to add is DA-seq, which seems worth directly comparing against the others thus-far suggested, and is available here: https://github.com/KlugerLab/DAseq https://www.biorxiv.org/content/10.1101/711929v1

I also very much agree with @emdann's description of a population recall parameter, which would depend heavily on the parameterization and types of perturbations that we enforce of perturbed_populations. I wonder if it could be ideal to perform a sliding grid search with differently weighted perturbations, to differentiate some models that might perform better on near perturbations. This would be along a similar line to using different kinds of perturbations (weighted diffusion components, PCs, probabilistic models.)

dburkhardt commented 3 years ago

hi @dburkhardt and all,

This is a very cool and interesting project!

To quantify differential abundance between datasets you might also be interested in trying logistic regression to compute the OR between conditions and adjust for covariates.

I link a post by Valentine Svensson in which he shows how to use this implementation and provides a dataset and a twitter thread where there is some open discussion. Code and sample data for the post are here.

Hi @MarcElosua, I'm excited that you're interested in OpenProblems! I had seen this article on GLMs, and I think the idea of using a Logistic Regression Classifier is somewhat similar to the DA-seq package that @wes-lewis mentioned above. We readily welcome contributions to this discussion in this thread and in PRs!

dburkhardt commented 3 years ago

Thoughts around a second subtask, as proposed by @wes-lewis. I’ve highlighted some issues that IMO require some discussion.

Detection of perturbed subpopulations

Aim: we want to evaluate the ability of DA methods to detect subpopulations of cells that are similar in gene expression space and similarly enriched or depleted by an experimental perturbation. This subtask should evaluate the sensitivity to detect differential abundance in small subpopulations enriched or depleted upon perturbation (conversely, the task evaluating single-cell condition probability gives high scores to a method detecting all cells in a large cluster showing DA, but no cells in smaller clusters).

Input Considering a simple case with 2 experimental conditions 1.expression_matrix array-like, shape=(n_obs, n_var) - a single-cell expression matrix

sample_labels array-like, shape=(n_obs,) - a sample label for each cell giving the condition / replicate for each cell

perturbed_populations array-like, shape=(n_obs,) - a vector that assigns cells to a perturbed subpopulation. How should these partitions be defined? A simple option is to use the KNN graph over the whole dataset, remove edges between not DA cells (e.g. condition probability in [0.4, 0.6]) and between cells with opposite “direction of DA”, use new graph for clustering DA cells. IMO this works to evaluate sensitivity to small subpopulations, but not to define a “ground-truth partition”.

perturbation_effect: array-like, shape=(n_obs,) - this needs to store the DA effect size for each perturbed population detected (at least indicate enrichment/depletion/noDA based on a certain threshold).

Output: (dependent on choice of metrics)

Possible metrics:
* Population-wise recall: fraction of true positives identified from each subpopulation defined in `perturbed_populations`. In this case the output could be a simple vector of DA labels for each single cell (i.e. enriched/depleted/notDA)

* _Do we want to evaluate the ability to recover the same partitions as stored in `perturbed_subpopulations`?_ Incorporating some metric for clustering similarity? If we want to do this, then we should discuss how to generate the ground-truth partitions.

I like this idea. I think the questions I have are 1) How do you decide which subpopulations are the deferentially abundant between samples 2) Is the idea that the affected subpopulations have a uniform level of enrichment that changes discretely between subpopulations?

I understand the algorithmic appeal of this. For the MELD manuscript, we introduce a clustering algorithm to identify DA populations, and for the benchmarks I had to do an experiment like this where there's a sharp jump in enrichment between adjacent clusters. My concern is that this isn't a plausible model for biology. I don't think two cells adjacent on a kNN graph would be expected to have a step-change in enrichment. Rather, I think that these changes are likely to be smooth. Maybe we could have some smooth transition between the adjacent populations and define the boundry as the half-way point? E.g. if you had a subpopulation with an enrichment of 0.3 next to a population of 0.7, you could create a smooth transition between them and then define the "ground truth boundary" is at cells greater or less than 0.5. IF this doesn't make sense I can draw what I'm thinking

emdann commented 3 years ago

How do you decide which subpopulations are the deferentially abundant between samples

Good question. We could generate conditional probabilities with simulate_treatment and select a threshold (e.g. prob > 0.6 == enrichment, prob < 0.4 == depletion) and we could chart the methods' power at different threshold levels. Alternatively, we could define populations a priori - e.g. with clustering or using true cell type labels from the dataset - and assign enrichment probabilities to each population. This brings me to point 2...

I understand the algorithmic appeal of this. For the MELD manuscript, we introduce a clustering algorithm to identify DA populations, and for the benchmarks I had to do an experiment like this where there's a sharp jump in enrichment between adjacent clusters. My concern is that this isn't a plausible model for biology. I don't think two cells adjacent on a kNN graph would be expected to have a step-change in enrichment. Rather, I think that these changes are likely to be smooth. Maybe we could have some smooth transition between the adjacent populations and define the boundry as the half-way point? E.g. if you had a subpopulation with an enrichment of 0.3 next to a population of 0.7, you could create a smooth transition between them and then define the "ground truth boundary" is at cells greater or less than 0.5.

I completely agree with this, and indeed we also found that simulating enrichment with "jumps" is an artifact that increases the FDR significantly for all methods. I like the idea of incorporating smooth transitions, how could that look in practice? The first thing I can think of is to use something like fuzzy clustering and use the "membership percentage" for each cell to generate gradual transitions in probability.

emdann commented 3 years ago

See an example below on a dataset of mouse gastrulation I was playing with. Here I assign condition probabilities based on distances from cluster centers, where each center has a baseline probability (here sampled randomly from a uniform distribution). Is this close to what you were thinking? Screenshot 2021-01-27 at 11 52 36

dburkhardt commented 3 years ago

I was inspired by this idea @emdann, and I think I have an idea that might work well for us. Thinking about these cluster centers as the local maxima or minima of this probability function give us an interesting way to smoothly interpolate between the heights of those peaks and the rest of the valleys. There's a process called inverse distance weighting that gives a relatively simple approach to smooth interpolation based on distance to a set of known references. The code below picks an arbitrary set of references "peaks" and then uses graph diffusion distance weights to smoothly interpolate between those peaks. I think these results look pretty good!

I like this because we can pick the centers of the peaks and valleys without just using the density of the data (i.e. cluster centroids). Maybe we don't want to just use random points, but this is a little more flexible. What do you think?

import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial.distance
import sklearn
import scprep
import graphtools at gt
import pysgp

# Build graph
G = pygsp.graphs.Grid2d(32,32)
coords = G.coords
G = gt.Graph(G.A, precomputed="adjacency", use_pygsp=True)
G.set_coordinates(coords)

fig, axes = plt.subplots(2,2, figsize=(10,8))

for ax in axes.flatten():

    n_peaks = 4

    # Get peak indicies
    peaks = np.random.choice(G.N, n_peaks, replace=False)

    # Mask to get values that aren't peaks for interpolation
    not_peaks = ~np.isin(np.arange(G.N), peaks)

    # Assign values to peaks
    values = np.array([0,1,0,1])

    # Power diff_op
    diff_op = np.linalg.matrix_power(G.diff_op.toarray(), 20)

    #Calculate pairwise distances
    dist = scipy.spatial.distance.pdist(diff_op, metric='cityblock')
    dist = scipy.spatial.distance.squareform(dist)

    # Calculate weights using inverse distances, L1 norm for convex combo
    weights = sklearn.preprocessing.normalize(np.exp(-dist[not_peaks][:, peaks] * 2), 'l1')

    # Do interpolation
    interp = weights  @ values

    # Create final output array
    output = np.zeros(G.N)
    output[peaks] = values
    output[not_peaks] = interp

    scprep.plot.scatter2d(G.coords, c=output, cmap='RdBu_r', s=20, ax=ax)
    scprep.plot.scatter2d(G.coords, c='red', mask=~not_peaks, s=50, ax=ax)

fig.tight_layout()
fig.savefig('../img/probability_density.png', dpi=150)

probability_density

wes-lewis commented 3 years ago

Thank you for making this idea more concrete @emdann and @dburkhardt! From my perspective, one advantage of @emdann's proposal is to utilize known cluster identities within a set of carefully labeled datasets, although it might be the case that these cluster centroids on the combined matrix do not often correspond with enrichment/depletion between samples in a real setting. Nonetheless, having DA benchmarks and evaluations synthetically create a "ground truth" from real data seems much more feasible and objective than searching for examples of un-challenged ground truth datasets for DA testing, and has advantages of smoothness and alterable parameters.

One major question seems to be how we want to set our centroids for the insertion of synthetic enrichment/depletion within the data. For simplicity, Dan's example uses randomized points, whereas Emma's uses cluster centroids. I think that a mix of these two approaches might fit closest to the spirit of DA testing. Rather than setting smoothly enriched or depleted regions by proximity to cluster centers, or ignoring ground-truth clustering information from the data, we could take a two step approach such as this:

Create a smooth probability distribution P where the probability assigned to a given cell varies by some defined function/conditional probability with proximity to the known cluster centroids. This is in preparation for step 2, where cells can be sampled from this distribution as the centroids for enrichment/depletion effects. This would controllably attenuate the effect of enrichment or depletion being localized directly at cluster centers. One justification for doing this is that we might expect the centroid of a cluster to be relatively archetypal of its cell type or biological programming, making the centroids of most cell types likely to be maintained (rather than enriched/depleted) between samples. To control the strength of this attenuation, one could vary the influence of proximity to cluster centroids on values in P to balance clustering-informed vs uniform selection of centroids for enriched/depleted regions.

If we wanted to go even further and prioritize the enrichment or depletion of regions at the fringes of clusters, we could randomly split the known cluster centroids into two sets (A and B), with proximity to centroids in set A being positively/directly related with probabilities in P and proximity to centroids in B being inversely related with probabilities in P.

Sample from P a set of cells to use as centers of enrichment or depletion, with surrounding points varying smoothly between these centers, as shown in @dburkhardt's example. These would be the final probabilities used to simulate treatments of corresponding enrichment/depletion.

One thing that I'm still uncertain about is how we're actually simulating treatments from these distributions. It seems that both @dburkhardt and @emdann have a better grasp of how to use simulate_treatment in this case. My apologies if the heuristic I've just suggested is overkill/unnecessary as a result of my lack of familiarity with that method.

dburkhardt commented 3 years ago

I think you're on the right track here @wes-lewis. My argument thought is that I think this is a strong assumption I haven't seen supported by my experience (although perhaps we should investigate this further).

One justification for doing this is that we might expect the centroid of a cluster to be relatively archetypal of its cell type or biological programming, making the centroids of most cell types likely to be maintained (rather than enriched/depleted) between samples.

Why should density centers in the data or nodes with the most-centrality (what actually is the centroid of a louvain cluster?) be the peaks of differential abundance? Take this dataset from the crop-seq paper:

It's pretty clear that the areas with the most enrichment or depletion are actually at the extrema of the dataset. These are not cluster centroids.

Are there some datasets where the cluster centroids are the extrema? Sure. Is that true for a majority of datasets? I'm less convinced.

An advantage of randomly selecting peaks and valleys is that it ensures that a method is not bound to predicting differential abundance only at areas of density in the dataset or on a graph.

The idea of making the enrichment inversely related to the distance to the cluster centroid is an interesting one, but I think you'll find this isn't actually straightforward to do in practice while keeping the signal smooth on the graph and maintaining the ability to determine what defines an "enriched reason" although I'd love to be proved wrong.

Also to address your question about simulate_treatment, we're essentially just using P to parameterize np.random.choice(['treatment','control']) for each cell. Cells are then randomly split into a number of replicates.

wes-lewis commented 3 years ago

Thanks for the quick reply @dburkhardt! I appreciate that perspective, and generally agree with you. I've added some comments and thoughts below:

It's pretty clear that the areas with the most enrichment or depletion are actually at the extrema of the dataset. These are not cluster centroids. Are there some datasets where the cluster centroids are the extrema? Sure. Is that true for a majority of datasets? I'm less convinced.

I agree with these points completely. It's my experience as well that DA regions tend to be the extrema, and I keep returning to the thought that being able to preferentially sample peaks of DA regions from the extrema could therefore be helpful. Regardless, I agree that it could be beneficial to hold off on including strong assumptions into our regime for sampling DA peaks. It's also the case that randomly sampling DA peaks will already prioritize sampling from the densest regions, which is a behavior we may want to correct for. I can think of a few ways to somewhat correct for this (sampling DA peaks using a coefficient on KNN density, or sampling with binning over the embedding space,) but I'm happy to hear your thoughts on this as well.

The idea of making the enrichment inversely related to the distance to the cluster centroid is an interesting one, but I think you'll find this isn't actually straightforward to do in practice while keeping the signal smooth on the graph and maintaining the ability to determine what defines an "enriched reason" although I'd love to be proved wrong.

I'm also starting to second guess this idea of sampling DA regions from a distribution inversely related to cluster centroid proximity, but from a different perspective. I had been imagining datasets with separate/discontinuous clusters (maybe as a result of me looking at some multi-system/atlas data as of late,) but I hadn't thought much about developmental examples where clusters are transitioning on a clear and continuous manifold. This inverse weight would upweight the probability of sampling DA peaks at the extrema of the dataset, but would also upweight the continuous regions between neighboring clusters. This would prioritize the case of some replicates having enriched/depleted regions within transitional areas that are very important in the manifold structure and underlying biology. I'm certain that DA in these regions happens sometimes and is not biologically rare/impossible, but to artificially increase these instances in testing would be a bad idea (similarly to the opposite idea, of basing DA peaks at cluster centroids).

dburkhardt commented 3 years ago

So if I can summarize so far, it sounds like we're all on board with the idea of using some set of references to pick the local minima and maxima of enrichment and then smoothly interpolating between those peaks? Please correct me if I'm wrong @emdann and @wes-lewis.

At this point, it seems the questions are:

1) How do we pick the minima/maxma 2) How do we interpolate between them 3) How do we binarize this probability to define regions that are "enriched"

Picking the peaks To pick the minima/maxma, I realize we can't just use uniform random sampling of indices. I think we could use some form of density-aware sampling like binning an embedding and sampling from bins (like that geometric sketching paper). There's also a way to do this sampling from a graph using diffusion wavelets that someone in our lab is working on.

Interpolation I think this inverse distance weighting will likely be our best bet. I know @wes-lewis brought up:

This inverse weight would upweight the probability of sampling DA peaks at the extrema of the dataset, but would also upweight the continuous regions between neighboring clusters.

But I'm not sure what is meant by this. If you get far away from a reference, IWD will converse to the mean of the dataset. See the extrema here and in the figure I posted above:

Shepard_interpolation_1_dimension

The boundaries are no more or less likely to be picked as the peaks, and the farther you get from a peak, the more average the value becomes.

Discretizing I'm not convinced this is an easy problem. Why are 0.4 and 0.6 better than [0.45, 0.65] or [0.3, 0.7]? I like the idea of having some sense of confidence used to pick thresholds, like the pvalues in Milo, but I'm not sure how to do that. Do methods get to know what the thresholds are that we're using?

Would this be a valid solution to the task:

Estimate the underlying probability
Cut the estimate at values of 0.4 and 0.6 (or whatever our threshold is)
Ensure that each partition is connected on a graph

If so, how does this different from just directly estimate the underlying probability for each cell?

emdann commented 3 years ago

Thanks @dburkhardt and @wes-lewis for the very insightful discussion! I fully agree with points (1) and (2) in the summary from @dburkhardt.

On the point of discretization, converting a population-level (be it neighbourhoods, or clusters, or cell types) fold-change to a condition probability for each cell is not straightforward, since population-level probabilities are not independent. An option could be to compare fold-changes and probabilities directly with a scale-independent metric e.g. correlation, but this would again be suboptimal since fold-changes alone are dependent on the size of the population, variance between replicates etc. I expect similar problems would arise in converting the DA score from DAseq to a probability.

I'm open to suggestions on how to get around this, but I think setting a cutoff enables the use of more methods and is fair if the aim of the task is detect differential abundance, regardless of effect size. I'm taking this approach to expand our benchmark for revision of the Milo ms, and I'm thinking of using Precision-Recall curves over different cutoffs for "true DA" for evaluation. Another approach, taken by DA-seq, would be to set a cutoff for significance with a permutation of labels.

Do methods get to know what the thresholds are that we're using?

IMO yes. If you were an analyst using MELD to detect differential abundance, you would set a probability threshold that you consider significant.

I'm very curious to hear your ideas and thoughts on this!

dburkhardt commented 3 years ago

@emdann glad we mostly agree! I guess I'm not entirely sure what you're suggesting is the issue here. Maybe it would be best to set up a chat?

What do you mean by:

if the aim of the task is detect differential abundance, regardless of effect size

My understanding is that if we say differential abundance is smooth over a graph, then by definition DA is a measure that must be estimable at the level of individual cells/nodes because that's how one would calculate smoothness (e.g. using the graph Laplacian). I understand that is it useful to detect contiguous regions of the data space that have "similar DA" for the purpose of characterization and downstream analysis. But if those regions are determined as being within a threshold of DA measures, then isn't the region dependent on effect size?

emdann commented 3 years ago

I might be coming to this with too much of a statistical testing mindset, where the scientist decides what an interesting effect size and significance threshold are. Probably easier to organize a chat to pin down some details here!

dburkhardt commented 3 years ago

To-dos from our conversation today:

Add IWD + density subsampling for per-cell DA testing instead of current data generation strategy (Dan)
Add thresholding test, where we want all cells above or below a certain threshold, TPR/ FPR for different thresholds, not paired case control (Emma will start with 1 paragraph description)

emdann commented 3 years ago

Data: we simulate ground-truth differential abundance between biological conditions by selecting maxima and minima of DA and interpolate with IWD. This will be implemented in differential_abundance.datasets.utils._simulate_treatment

Picking points with geosketching: picking with PCA leads to biases, using PHATE or DiffMaps instead. Indices get stored in the anndata.uns.
Ensure that we don't have DA in most of the graph? tuning interpolation params or by selecting points with p = 0.5

TO DO: update _simulate_treatment function, incorporating parameters to control (A) size of DA region (how many points in the KNN graph represent a maximum/minimum of differences?), (B) DA effect size (probability at the maxima/minima, e.g. p=0.6 is small effect size and p=0.9 is high effect size). We start from @dburkhardt working implementation.

Metrics First objectives:

Accuracy of effect size estimate: average mean squared error between estimated and true probability across N simulations. How do we feed many different label simulations to the metric function? We could save N labels and underlying probabilities in the same anndata object
DA region recovery: TPR/FPR for identification of cells belonging to regions of DA, setting a threshold to the probability (possibly spitting out values for different probability thresholds to evaluate sensitivity)

Possible developments: split evaluation by size of DA region and effect sizes, controlling the label generation process

Methods: adding methods

Milo
DAseq
Simple differential testing on louvain clusters
(...?)

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had recent activity.