theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
141 stars 23 forks source link

All cell types as reference loop error #68

Closed AClab-sgarcia closed 1 year ago

AClab-sgarcia commented 1 year ago

Good morning, Thank you for developing this very useful package. I am running scCODA with my data. I am trying to use the code available in the section "Using all cell types as reference alternatively to reference selection".

`for ct in cell_types: print(f"Reference: {ct}")

# Run inference
model_temp = mod.CompositionalAnalysis(scCODA_data, formula="Cond", reference_cell_type=ct)
temp_results = model_temp.sample_hmc(num_results=20000)

# Select credible effects
cred_eff = temp_results.credible_effects()
cred_eff.index = cred_eff.index.droplevel(level=0)

# add up credible effects
results_cycle["times_credible"] += cred_eff.astype("int")`

As soon as you use the first cell_type, the following error occurs: Reference: 0 100%|██████████| 20000/20000 [01:02<00:00, 321.19it/s] MCMC sampling finished. (79.329 sec) Acceptance rate: 59.6% ValueError: cannot reindex on an axis with duplicate labels

I have tried to solve it, but all my solutions result in errors.

Thanks

johannesostner commented 1 year ago

Hi @AClab-sgarcia,

from your description of the error message, I cannot spot the problem. Could you please post the entire message? Also, which versions of scCODA, tensorflow, tensorflow-probability and pandas are you using? Just making sure that all versions are up to date and compatible...

Thanks!

AClab-sgarcia commented 1 year ago

Good morning @johannesostner

I can't send you any more information than that, as that's all the error I get. As I said, I followed the "Using all cell types as reference alternatively to reference selection" section of: https://sccoda.readthedocs.io/en/latest/Modeling_options_and_result_analysis.html

The version of the packages:

Thank you!

johannesostner commented 1 year ago

This seems to be a pandas error, but I cannot see which function caused this error from what you posted. I was looking for an error traceback like Error in ....: ValueError: cannot reindex on an axis with duplicate labels to point down which line caused the error, but I guess that it is this one: ´cred_eff.index = cred_eff.index.droplevel(level=0)´.

How many conditions do you have in your Cond column? If it's more than two, I could see a way how this error is caused

mbuttner commented 1 year ago

When your object scCODA_data is an anndata object, please check if the obs_names are unique. You may run scCODA_data.obs_names_make_unique(), which should fix issues with the indexing.

AClab-sgarcia commented 1 year ago

@johannesostner I have 4 samples, with 3 differente conditions in the "Cond" column. Yes, I think it's a pandas' error too, but I haven't been able to figure it out.

AClab-sgarcia commented 1 year ago

@mbuttner scCODA_data.obs_names output: Index(['AC1939', 'AC1941', 'AC1944_B', 'AC1944_T'], dtype='object', name='orig.ident')

mbuttner commented 1 year ago

Thanks. That looks good, though. The anndata object has more index objects, though, which may cause the error. Is scCODA_data.var_names unique? Can you share the output of print(scCODA_data), if possible?

AClab-sgarcia commented 1 year ago

Thanks for your quick reply @mbuttner

scCODA_data.var_names output: Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20'], dtype='object', name='seurat_clusters')

print(scCODA_data) output: AnnData object with n_obs × n_vars = 4 × 21 obs: 'Cond' var: 'n_cells'

mbuttner commented 1 year ago

OK, the indexes look all unique, the anndata object is fine as far as I can tell. Have you tried to run scCODA without a reference cell type? Does that result in the same error?

johannesostner commented 1 year ago

The problem does likely not come from the initial scCODA data but from the line cred_eff.index = cred_eff.index.droplevel(level=0), which uses the scCODA output DataFrame. The line drops the covariate indicator level (which is no problem if there is only one covariate, but leaves non-unique names if there are multiple covariates). I'll see if I can find a workaround

mbuttner commented 1 year ago

Thanks @johannesostner

johannesostner commented 1 year ago

@AClab-sgarcia just to make sure that I am not mistaken here: Are you able to run scCODA on your data outside of the loop? i.e.


model_temp = mod.CompositionalAnalysis(scCODA_data, formula="Cond", reference_cell_type="automatic")
temp_results = model_temp.sample_hmc(num_results=20000)
AClab-sgarcia commented 1 year ago

@johannesostner I've been able to run scCODA both with a automatic reference selection and manual reference selection. I was trying to run it inside the loop as I can not really see any clsuter with low dispersion of relative abundance. I just wanted to compare the different results with different clusters as the reference.

johannesostner commented 1 year ago

Ok, I managed to reproduce your error and find a workaround - the error was indeed caused by the multiple covariates in your data.

You can use this code as an alternative:


cell_types = scCODA_data.var.index
results_cycle = None

for ct in cell_types:
    print(f"Reference: {ct}")

    # Run inference
    model_temp = mod.CompositionalAnalysis(scCODA_data, formula="Cond", reference_cell_type=ct)
    temp_results = model_temp.sample_hmc(num_results=20000, num_burnin=5000)

    # Select credible effects
    cred_eff = temp_results.credible_effects()
    cred_eff.index = ['_'.join(i) for i in cred_eff.index.values]

    # add up credible effects
    if results_cycle is None:
        results_cycle = pd.DataFrame(index=cred_eff.index, columns=["times_credible"]).fillna(0)
    results_cycle["times_credible"] += cred_eff.astype("int")
AClab-sgarcia commented 1 year ago

What do you mean by multiple covariates in my data? Do you mean that i was using 3 conditions for 4 different samples?

johannesostner commented 1 year ago

Exactly! The code in the tutorial was only intended for a dataset with one covariate, i.e. a comparison of two conditions. It is totally valid to do the same analysis while comparing multiple conditions (resulting in multiple covariates), as you intend to do. The code you used was simply not intended to deal with this scenario, therefore I created a small workaround for your case that does the same, but can deal with multi-group comparisons

AClab-sgarcia commented 1 year ago

Thank you so much for the help and also for the package!