what's the differences between lincs_full.h5ad and lincs.h5ad?

huawen-poppy commented 1 year ago

Hello. Thank you for the nice work.

Following the lincs.ipynb file, I generated the lincs_pp.h5ad file. But when I followed the lincs_SMILES.ipynb file, it used lincs_full_pp.h5ad as input. I am wondering what is the differences between lincs.h5ad and lincs_full.h5ad? When to use the full file?

siboehm commented 1 year ago

@MxMstrmn can you help out here?

MxMstrmn commented 1 year ago

Hi @huawen-poppy,

The preprocessing pipeline should be able to deal with both lincs_pp.ipynb and lincs_full_pp.ipynb. I must have renamed them for clarity as some point. When I check:

from chemCPA.paths import DATA_DIR

adata_path = DATA_DIR / "lincs_small_.h5ad"
adata_path_full = DATA_DIR / "lincs_complete.h5ad"

assert adata_path.exists()
assert adata_path_full.exists()
#%%
import scanpy as sc

adata_small = sc.read(adata_path)
adata_full = sc.read(adata_path_full)

# %%
print(adata_small)
print(adata_full)

I get the following result:

AnnData object with n_obs × n_vars = 199620 × 978
    obs: 'cell_id', 'det_plate', 'det_well', 'lincs_phase', 'pert_dose', 'pert_dose_unit', 'pert_id', 'pert_iname', 'pert_mfc_id', 'pert_time', 'pert_time_unit', 'pert_type', 'rna_plate', 'rna_well', 'batch', 'condition', 'cell_type', 'dose_val', 'cov_drug_dose_name', 'control', 'split'
    var: 'pr_gene_title', 'pr_is_lm', 'pr_is_bing'
    uns: 'rank_genes_groups_cov'
AnnData object with n_obs × n_vars = 840677 × 977
    obs: 'cell_id', 'det_plate', 'det_well', 'lincs_phase', 'pert_dose', 'pert_dose_unit', 'pert_id', 'pert_iname', 'pert_mfc_id', 'pert_time', 'pert_time_unit', 'pert_type', 'rna_plate', 'rna_well', 'condition', 'cell_type', 'dose_val', 'cov_drug_dose_name', 'control', 'split', 'canonical_smiles', 'split1', 'random_split', 'split_ood_drugs'
    var: 'pr_gene_title', 'pr_is_lm', 'pr_is_bing', 'gene_id', 'in_sciplex'
    uns: 'cydata_pull', 'rank_genes_groups_cov'

So it is just a matter of dataset size. The difference in gene numbers comes from the fact that I was not able to match one of the 978 genes with the sci=Plex-3 data.

I hope that clarifies this. Let me know if you encounter further issues!

huawen-poppy commented 1 year ago

Hello @MxMstrmn , Thank you for your kind explanation! It's clear to me!

theislab / chemCPA

what's the differences between lincs_full.h5ad and lincs.h5ad? #110