theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
152 stars 24 forks source link

alr t test #100

Open feanaros opened 4 months ago

feanaros commented 4 months ago

Hi, what is meant by alr_t_model.fit_model(reference_cell_type=4) in the frequentist test? I don't know which cell type to use as a reference.

Moreover, when I loop to evaluate the best reference cell type, after:

# Calculate percentages
results_cycle["pct_credible"] = results_cycle["times_credible"]/len(cell_types)
results_cycle["is_credible"] = results_cycle["pct_credible"] > 0.5
print(results_cycle)

all the results of is_credible are FALSE. Can you give me some advices?

johannesostner commented 4 months ago

Hi @feanaros,

the ALR-t model performs a t-test on additive log-ratio (ALR) transformed data. For this transformation, you need a reference, as in the scCODA model. If you want to compare both approaches, you should use the same reference. In your example, we took the 5th cell type in the dataset as a reference.

As for your second question, this happens if there's no cell type in your dataset that was credible in more than half of the runs. You could either raise the FDR rate in each run or lower the threshold in the second line of the code snippet you showed to get credible effects.

feanaros commented 4 months ago

@johannesostner thank you. So, for the first question, if I don't have a reference, how can I decide it? With the second test? or in a different way?

however, in the standard analysis of sccoda, when I print print(sim_results.credible_effects()), it gives me all False. Is my analysis not significant in therms of proportions? (I set both FDR 0.1 and 0.4, same result) There are some parameters I can change? I'm not able to run Aldex or other tests. I have a couple of errors. Maybe I'm doing it some wrong

this is my input:

df = pd.read_csv("/Users/olga/table_sample_cluster.csv")
df

    Unnamed: 0  C0  C1  C10     C11     C2  C3  C4  C5  C6  C7  C8  C9  Sample  Genotype
0   notch3_1    856     897     10  0   542     223     228     6   62  39  24  19  notch3_1    notch3
1   notch3_2    974     749     14  0   512     180     156     186     52  30  17  14  notch3_2    notch3
2   notch3_3    1401    1320    22  1   942     286     304     63  104     46  42  10  notch3_3    notch3
3   wt_1    725     595     11  7   562     145     147     16  50  61  28  14  wt_1    wt
4   wt_3    1508    1164    14  49  1029    304     263     187     112     99  39  36  wt_3    wt

data_all = dat.from_pandas(df, covariate_columns=["Sample", "Genotype","Unnamed: 0"])
data_all.obs
Sample  Genotype    Unnamed: 0
0   notch3_1    notch3  notch3_1
1   notch3_2    notch3  notch3_2
2   notch3_3    notch3  notch3_3
3   wt_1    wt  wt_1
4   wt_3    wt  wt_3

data_all

AnnData object with n_obs × n_vars = 5 × 12
    obs: 'Sample', 'Genotype', 'Unnamed: 0'

data_all.obs["Genotype"] 

0    notch3
1    notch3
2    notch3
3        wt
4        wt
Name: Genotype, dtype: object

print(data_all)

AnnData object with n_obs × n_vars = 5 × 12
    obs: 'Sample', 'Genotype', 'Unnamed: 0'

# Select control and TBX1 KO data
data_day = data_all[data_all.obs["Genotype"].isin(["notch3", "wt"])]
print(data_day.obs)

     Sample Genotype Unnamed: 0
0  notch3_1   notch3   notch3_1
1  notch3_2   notch3   notch3_2
2  notch3_3   notch3   notch3_3
3      wt_1       wt       wt_1
4      wt_3       wt       wt_3
johannesostner commented 4 months ago

The tests you mention (ALR-t, ALDEx2, ...) are alternatives to scCODA, which we mainly used for the comparison study in our paper (https://www.nature.com/articles/s41467-021-27150-6). For ALDEx2 and ANCOM-BC, you'll need an R environment with the packages installed and set the correct paths to this environment (r_home and r_path in the tutorial, these are specific to your operating system/installation location) scCODA has an automatic reference selection (by setting reference="automatic"), which you can use. It will tell you the reference it selected.

Currently, you don't get any credible effects from your data, probably because the proportions do not differ too much between the conditions. Due to your low sample size, you'll need a considerable effect for it to be detected credibly. You might want to visually check your data through a grouped boxplot like in our advanced tutorial. There, you can also find more info on reference selection, etc.