theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
179 stars 23 forks source link

Possible to set reference condition for two_sample tests? #205

Open adkinsrs opened 2 years ago

adkinsrs commented 2 years ago

I was originally trying to use the rank_genes_groups() function in scanpy to get p-values and logfoldchanges between a single condition and a reference condition. My end-goal was to take the DE results and make a volcano plot. After reading some of the scanpy Github issue tickets (https://github.com/theislab/scanpy/issues/397), I learned of diffxpy and how it seems to be recommended for DEG over scanpy in general situations.

I like how there is less need to transform the diffxpy.api.test.t_test output when compared to the output of scanpy.tl.rank_genes_groups to get the data into a format to create a volcano plot for my code. However, it seems that no matter which order I supply the two conditions for the t-test, the resulting output is the same. This results in some volcano plots that, when compared to a plot made using scanpy.tl.rank_genes_groups output, has about the same p-val but a log-fold change with the opposite sign.

Is there some easy way of forcing one of the conditions to be the reference condition for the diffxpy DEG tests? I apologize if I'm missing some inherent understanding or limitation of the tool.

Also, I think this question relates to #188 and to #184.

davidsebfischer commented 2 years ago

Hi @adkinsrs thanks for the issue. This is currently not supported as grouping formats are lost during parsing, the current state may also be desirable because this makes the test output invariable wrt data permutation. But I see your point here, enabling this would however require and extension of the code, specifically:

I will leave this open until addressed and would invite code contributions from interested parties via PR on dev branch!

adkinsrs commented 2 years ago

@davidsebfischer, thank you for the response. From my testing and tinkering it seems the issue is using numpy's unique function in https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L114.

The numpy.unique function sorts the unique values before returning the output. The pandas.Series.unique() method actually returns the unique values based on the order encountered, which works for my use-case provided I pre-sort the adata.obs object beforehand. I went into my local copy of diffxpy and changed that line to pd.Series(grouping).unique(), and it does give the output I was expecting as well as a mirrored volcano plot when I switch my two conditions.

I was thinking I could create a PR for this, but being it was a one-line change, I would want to know if this change is feasible for the bigger picture. Also should this be expanded as an option for the t-test/rank-test functions, or does that not matter so much since some people may not care which condition is the baseline/reference?

davidsebfischer commented 2 years ago

Thanks a lot for looking into this @adkinsrs! The issue with using pandas functions here is that the argument grouping of split_x does not necessarily receive pandas.Series in all cases. In particular in this case, if grouping was given to t_test as string pointing to a column in an adata.obs, this line her decomposes the pandas.Series https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L109. I guess you gave the pandas.Series directly to t_test::grouping instead of going via the string?

We could however extend your fix, encoding the error in the category order of the pandas.Series, and return this order as additional return from https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L105, which needs to be accepted in a few places centred around 2-group tests in https://github.com/theislab/diffxpy/blob/dev/diffxpy/testing/tests.py, and then add this additional group order input to

We could leave this option out for pairwise and versus_rest for now. I think having this as an additional argument in the code that is hidden from the user (outlined above) is desirable of encoding it as a pandas.Series as they tend to make code harder to understand if one is skimming input types in this context, at least this was my experience and this is the reason why we force transform to numpy internally.

adkinsrs commented 2 years ago

Thanks a lot for looking into this @adkinsrs! The issue with using pandas functions here is that the argument grouping of split_x does not necessarily receive pandas.Series in all cases. In particular in this case, if grouping was given to t_test as string pointing to a column in an adata.obs, this line her decomposes the pandas.Series

https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L109

. I guess you gave the pandas.Series directly to t_test::grouping instead of going via the string?

Actually I passed a string to an adata.obs column to the "grouping" parameter. I will paste my transformation code from the REPL

Python 3.7.3 (default, Apr 19 2021, 15:07:07)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import anndata, diffxpy.api as de
/opt/Python-3.7.3/lib/python3.7/site-packages/pandas/compat/__init__.py:117: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
>>> adata = anndata.read("./df726e89-b7ac-d798-83bf-2bd69d7f3b52.h5ad")
>>> adata.obs
                     cluster  replicate      louvain
index
Ute_P1_Neg_0             TEC          5          TEC
Ute_P1_Neg_5             TEC          3          TEC
Ute_P1_GFP_3          SC (i)          8       SC (i)
Ute_P1_GFP_0         SC (ii)         23      SC (ii)
Ute_P1_tdTom_3       HC (ii)         15      HC (ii)
...                      ...        ...          ...
Ute_P1_Neg_31            TEC         25          TEC
Ute_P1_GFP_43        SC (ii)         32      SC (ii)
Ute_P1_GFP_45        SC (ii)         39      SC (ii)
Ute_P1_GFP_49        SC (ii)         30      SC (ii)
Ute_P1_tdTom_68  HC (iii-iv)         40  HC (iii-iv)

[158 rows x 3 columns]
>>> de_filter1 = adata.obs.cluster.isin(["HC (i)"])
>>> de_filter2 = adata.obs.cluster.isin(["SC (i)"])
>>> selected1 = adata[de_filter1, :]
>>> selected2 = adata[de_filter2, :]
>>> de_selected = selected1.concatenate(selected2)  # Concatenate to sort in the order I want
>>> de_selected.obs
                  cluster  replicate louvain batch
index
Ute_P1_tdTom_9-0   HC (i)          6  HC (i)     0
Ute_P1_tdTom_0-0   HC (i)         13  HC (i)     0
Ute_P1_Neg_26-0    HC (i)          1  HC (i)     0
Ute_P1_tdTom_19-0  HC (i)         12  HC (i)     0
Ute_P1_tdTom_37-0  HC (i)          9  HC (i)     0
Ute_P1_tdTom_20-0  HC (i)         14  HC (i)     0
Ute_P1_tdTom_25-0  HC (i)         11  HC (i)     0
Ute_P1_tdTom_47-0  HC (i)          4  HC (i)     0
Ute_P1_tdTom_57-0  HC (i)         10  HC (i)     0
Ute_P1_tdTom_52-0  HC (i)          8  HC (i)     0
Ute_P1_tdTom_66-0  HC (i)          2  HC (i)     0
Ute_P1_tdTom_63-0  HC (i)          3  HC (i)     0
Ute_P1_tdTom_60-0  HC (i)          5  HC (i)     0
Ute_P1_tdTom_64-0  HC (i)          7  HC (i)     0
Ute_P1_GFP_3-1     SC (i)          8  SC (i)     1
Ute_P1_GFP_1-1     SC (i)          9  SC (i)     1
Ute_P1_Neg_3-1     SC (i)          4  SC (i)     1
Ute_P1_GFP_2-1     SC (i)         11  SC (i)     1
Ute_P1_GFP_10-1    SC (i)         10  SC (i)     1
Ute_P1_Neg_15-1    SC (i)          2  SC (i)     1
Ute_P1_Neg_20-1    SC (i)          1  SC (i)     1
Ute_P1_Neg_19-1    SC (i)          3  SC (i)     1
Ute_P1_GFP_9-1     SC (i)          6  SC (i)     1
Ute_P1_GFP_26-1    SC (i)          7  SC (i)     1
Ute_P1_GFP_34-1    SC (i)         14  SC (i)     1
Ute_P1_GFP_37-1    SC (i)         13  SC (i)     1
Ute_P1_GFP_35-1    SC (i)         15  SC (i)     1
Ute_P1_Neg_34-1    SC (i)          5  SC (i)     1
Ute_P1_GFP_48-1    SC (i)         18  SC (i)     1
Ute_P1_GFP_39-1    SC (i)         17  SC (i)     1
Ute_P1_GFP_40-1    SC (i)         16  SC (i)     1
Ute_P1_GFP_53-1    SC (i)         12  SC (i)     1
>>> de_selected.obs.cluster.unique()    # HC is encountered before SC
array(['HC (i)', 'SC (i)'], dtype=object)
>>> de_selected_opposite = selected2.concatenate(selected1)
>>> de_selected_opposite.obs.cluster.unique()    # SC is encountered before HC. The "np.unique(de_selected_opposite.obs.cluster)" call gives HC first, due to the sorting of uniq vals
array(['SC (i)', 'HC (i)'], dtype=object)
>>> de.test.t_test(de_selected, grouping="cluster", gene_names=de_selected.var["gene_symbol"], is_logged=True).summary()
                gene      pval      qval    log2fc      mean  zero_mean  zero_variance
0      0610005C13Rik       NaN       NaN  0.000000  0.000000       True           True
1      0610009B22Rik  0.242574  0.584453 -1.766539  5.018637      False          False
2      0610009E02Rik  0.528240  0.753421  0.893326  2.728064      False          False
3      0610009L18Rik  0.724149  0.872133  0.249389  0.570136      False          False
4      0610010F05Rik  0.473403  0.716048 -1.073669  1.446289      False          False
...              ...       ...       ...       ...       ...        ...            ...
22807         Zyg11a       NaN       NaN  0.000000  0.000000       True           True
22808         Zyg11b  0.272259  0.584453 -1.095174  0.903192      False          False
22809            Zyx  0.488327  0.727949  1.132955  6.172250      False          False
22810          Zzef1  0.952420  0.980600 -0.075449  0.599622      False          False
22811           Zzz3  0.972575  0.989865  0.046684  2.544517      False          False

[22812 rows x 7 columns]
>>> de.test.t_test(de_selected_opposite, grouping="cluster", gene_names=de_selected_opposite.var["gene_symbol"], is_logged=True).summary()
                gene      pval      qval    log2fc      mean  zero_mean  zero_variance
0      0610005C13Rik       NaN       NaN  0.000000  0.000000       True           True
1      0610009B22Rik  0.242574  0.584453  1.766539  5.018637      False          False
2      0610009E02Rik  0.528240  0.753421 -0.893326  2.728064      False          False
3      0610009L18Rik  0.724149  0.872133 -0.249389  0.570136      False          False
4      0610010F05Rik  0.473403  0.716048  1.073669  1.446289      False          False
...              ...       ...       ...       ...       ...        ...            ...
22807         Zyg11a       NaN       NaN  0.000000  0.000000       True           True
22808         Zyg11b  0.272259  0.584453  1.095174  0.903192      False          False
22809            Zyx  0.488327  0.727949 -1.132955  6.172250      False          False
22810          Zzef1  0.952420  0.980600  0.075449  0.599622      False          False
22811           Zzz3  0.972575  0.989865 -0.046684  2.544517      False          False

[22812 rows x 7 columns]

# For each gene between the two summary dataframes, the pval/qval is the same, but the log2fc is of the opposite sign due to switched order of HC (i) and SC (i)
jorvis commented 2 years ago

We are currently using @adkinsrs version in production to address issues we had.

Benfeitas commented 1 year ago

I'd like to mention that on version 'v0.7.4+21.g12f1286 the problem remains, and only https://github.com/theislab/diffxpy/issues/205#issuecomment-882643747 solves the issue.