martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
114 stars 16 forks source link

The graphs from a single result and multiple results are inconsistent #74

Closed QianChwnLyn closed 7 months ago

QianChwnLyn commented 1 year ago

Hello, when I looked at the enrichment of AD symptoms in my data alone, I found that AD was enriched in MG. But when I looked at multiple diseases at the same time, it was not significant. Please tell me how to solve this problem.

b001bdf308eb97a86ae2e1b31903aff

image

952f687118cbfbc956ce6a9dc993b5f
martinjzhang commented 1 year ago

It is probably a multiple hypothesis testing issue. For this figure, we controlled FDR all cell type-trait pairs at a stringent threshold <0.05 (benjamini hochberg). You can try a higher threshold like 0.1 or 0.2.

QianChwnLyn commented 1 year ago

Hello, is it modified during the perform-downstream process? I didn't find the relevant parameters in the parameters of the function. Also set "df_fdr_prop"=0.2 in the drawing function scdrs.util.plot_group_stats and the result remains unchanged.

发件人: Martin Jinye @.> 发送时间: 2023年11月20日 23:15 收件人: @.> 抄送: @.>; @.> 主题: Re: [martinjzhang/scDRS] The graphs from a single result and multiple results are inconsistent (Issue #74)

It is probably a multiple hypothesis testing issue. For this figure, we controlled FDR all cell type-trait pairs at a stringent threshold <0.05 (benjamini hochberg). You can try a higher threshold like 0.1 or 0.2.

― Reply to this email directly, view it on GitHubhttps://github.com/martinjzhang/scDRS/issues/74#issuecomment-1819258281, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYQEHKHWGNES3ITAKVN5RHDYFNXXRAVCNFSM6AAAAAA7SN6CFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJZGI2TQMRYGE. You are receiving this because you authored the thread.Message ID: @.***>

martinjzhang commented 1 year ago

Hi @KangchengHou , can you help with this?

KangchengHou commented 1 year ago

Hi @QianChwnLyn

The small square for AD-MG denote cell-type-level association significance at FDR < 0.05, and therefore is expected to change if you plot one trait vs. multiple traits.

And I can confirm that plot_group_stats does not have the option to vary FDR significance threshold. Perhaps the most straightforward option is to plot your results using AD.scdrs_group.SubClass file directly. The column you want to use is assoc_mcz.

QianChwnLyn commented 1 year ago

Hi @KangchengHou

The drawing I understand is to put together the *.scdrs_group.SubClass results of multiple gene sets, and the drawing code does not see any further processing? So why do the results of drawing one and drawing multiple ones differ?

And I found that there is no such problem in some data when multiple gene sets are plotted on one graph. If this can be solved?

martinjzhang commented 1 year ago

Hi @QianChwnLyn

I want to point out that this is not a bug but rather a statistical nuance.

In multiple hypothesis testing (e.g., FDR control), the significance of a given hypothesis depends on p-values of other hypotheses. In our case, each hypothesis is a cell type-disease pair. If we test the hypotheses across diseases instead of just one disease, the results will change.

@KangchengHou, do you think it makes sense to update the plot_group_stats function with a tunable FDR threshold?

KangchengHou commented 1 year ago

yes see https://github.com/martinjzhang/scDRS/pull/75/commits/83d626445aeaa4bc556e7dcb553844921b2e2dc7 (i can't directly merge into master which is protected)

on the other hand, user could just take the raw dataframe and plot in the style as they want

QianChwnLyn commented 1 year ago

@KangchengHou

Thanks! How can I get your modified scdrs/util.py file replace my installed scdrs/util.py file? Maybe I need to modify the util.py according to your method?

KangchengHou commented 11 months ago

you could git clone the latest version to use this updated scdrs

wenjiezeng08 commented 9 months ago

Hi, thanks for the discussion! I have a follow-up question.

The output dataframe of the "_scdrs.method.downstream_groupanalysis" function has the columns of "assoc_mcp" and "assoc_mcz". Are the values in these two columns already corrected for multiple comparison, as indicated by the abbreviation "mc(multiple comparison)"? Or did I interpret the abbreviation incorrectly?

martinjzhang commented 9 months ago

no. "mc" refers to "Monte Carlo". you will still need to apply the FDR control procedure.

wenjiezeng08 commented 9 months ago

no. "mc" refers to "Monte Carlo". you will still need to apply the FDR control procedure.

Oh Thank you so much!

wenjiezeng08 commented 9 months ago

I also want my p value to be adjusted at the cell type level only, instead of for each cell-type-disease combination. I understand this is not a bug and the rationale behind the original paper, for using cell-type disease combination for correcting multiple testing. I think both ways can be justified.

Therefore, I found a way to customize the multi-disease/phenotype graph, which demonstrates the fdr corrected at cell type level only. This requires you to serve the the _df_fdrprop, _df_assocfdr, and _df_heterofdr separately, instead of using the _dict_dfstats parameter.

First, I calculated the BH-adjusted p value for each phenotype:

## Association FDR
for trait in dict_df_stats.keys():
    dict_df_stats[trait]['assoc_fdr']=multipletests(dict_df_stats[trait]['assoc_mcp'], method='fdr_bh')[1]
## Heterogeneous FDR
for trait in dict_df_stats.keys():
    dict_df_stats[trait]['hetero_fdr']=multipletests(dict_df_stats[trait]['hetero_mcp'], method='fdr_bh')[1]

Then, I defined the df_fdr_prop, df_assoc_fdr, and df_hetero_fdr separately, for which I also modified the fdr df to frd<0.2.

fig, ax = scdrs.util.plot_group_stats(##  dataframe of proportion of cells with FDR < 0.1: modified to 0.2
                                      df_fdr_prop= pd.DataFrame({trait: dict_df_stats[trait]['n_fdr_0.2']/dict_df_stats[trait]['n_cell']
                                                                 for trait in dict_df_stats.keys()}).transpose(),
                                      ## dataframe of group-level association statistics
                                      df_assoc_fdr= pd.DataFrame({trait: dict_df_stats[trait]['padj']
                                                                  for trait in dict_df_stats.keys()}).transpose(),
                                      ## dataframe of group-level heterogeneity statistics
                                      df_hetero_fdr= pd.DataFrame({trait: dict_df_stats[trait]['hetero_mcp']
                                                     for trait in dict_df_stats.keys()}).transpose(),
                                      ## threshold for FDR correction of cell type-level mean association statistics 
                                      assoc_fdr_threshold =0.2,
                                      ## Plotting parameters
                                      plot_kws={"cb_vmax": 0.1,### max value of the color bar
                                                "cb_fraction":0.1### fraction of colorbar
                                               }
                                     )

Thank you again for this great package and meaningful discussion.