scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

Best practice for handling replicates and associated statistics #106

Open dolleyj opened 6 years ago

dolleyj commented 6 years ago

I'm curious as to what would be the best practice for this situation...

I have an np.array containing 18 expression values for gene y. There are 3 replicates, so 6 different conditions. adata.X[:,0] = [ 72. 92. 51. 93. 1. 46. 0. 33. 46. 75. 56. 28. 90. 100. 7. 25. 40. 81.]

I need to calculate several values: replicate average, pvalue, FDR, standard error and standard deviation.

Currently, I can calculate average for the replicates. The result for the above example is: adata.Xmean[:,0] = [71.6667 71.6667 71.6667 46.6667 46.6667 46.6667 26.3333 26.3333 26.3333 53. 53. 53. 65.6667 65.6667 65.6667 48.6667 48.6667 48.6667]

This seems redundant as the average is listed for each replicate. It led me to think about separating the replicates into their own .X, like .X1 for replicate 1, .X2, etc. This would mean the .Xmean[0] would link to .X1[0], .X2[0], .X3[0].

The averages may be the only one to benefit from this setup, as each replicate will have its' own p-value, FDR, standard error, and standard deviation. This makes me think the redundant .Xmean is the better approach.

What are your thoughts? Thank you in advance!

jorvis commented 6 years ago

@falexwolf - feedback here would be appreciated. We are weary of rolling our own solution when a standard may be in place or planned.

falexwolf commented 6 years ago

Thank you for the reminder, Joshua! :smile:

How about doing this?

import scanpy.api as sc
import pandas as pd
adata = sc.datasets.toggleswitch()
adata.obs['replicate'] = 0
adata.obs['replicate'].loc[100:] = 1
df = pd.DataFrame(adata.X)  # does not allocate new memory if X is an array, so this efficient
df['replicate'] = adata.obs['replicate'].values  # if not using assign, no copy is made
df_grouped = df.groupby('replicate')
print(df_grouped.mean())
print(df_grouped.std())

outputs

                  0         1
replicate                    
0          0.510177  0.135317
1          0.152043  0.439836
                  0         1
replicate                    
0          0.293965  0.162549
1          0.153663  0.271669

Of course, you can add this stuff as unstructured annotation to an AnnData...

Does it answer your question?

PS: Visualize this is using ideas e.g. from https://stackoverflow.com/questions/46186784/handling-replicate-data-in-pandas

jorvis commented 6 years ago

I think it shows us some strategy, yes. I can't claim to understand what each line you did there does but I have a notebook open now and am trying to step through it. I think part of the issue here is we're wanting to store aggregate values like means within adata so that they don't have to be recomputed every time they are needed. Web interfaces are being driven off these files, so rapid access is preferred over storage concerns.

If, in the example of datasets with technical replicates, we almost always are only interested in the mean values across replicates. So if our input is like this:

screenshot from 2018-04-11 15-11-07

Simple demo with 4 genes, 2 conditions, and 3 replicates of each condition. Does one make this exact matrix adata.X?

screenshot from 2018-04-11 15-16-11

Or should the means be calculated and stored as adata.X with individual replicates stored as separate matrices of the same size, like adata.uns['rep1'], adata.uns['rep2'], ... etc. This is specifically the part I was asking about regarding conventions, as it seems there must already be a convention for storing technical replicates and their aggregate values (mean, stddev, etc.)

I think the examples given with values like 0 and 1 are a bit confusing because I can't tell if you're using that as a proxy for column names or are they index positions?