mims-harvard / TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
984 stars 173 forks source link

Load CRISPR perturbation datasets from scPerturb [Feature Request] #239

Closed abearab closed 5 months ago

abearab commented 5 months ago

Describe the bug

I'm interested in using single-cell CRISPR perturbation datasets such asNormanWeissman2019, and ReplogleWeissman2022 datasets.

Full list of scPerturb datasets

Questions

  1. I tried to review the codes in #236 but I didn't understand if datasets were collected directly from scPerturb or not. Could you provide more information, please?
  2. How can I use TDC modules to load the scPerturb datasets in Python?

Suggestion

h5ad files for RNA and protein datasets, created using scanpy 1.9.1

For many reasons, it would be nice if the data loader function could enable users to loadh5ad files as AnnData objects (at least as an option).


_Originally posted in https://github.com/mims-harvard/TDC/pull/236#discussion_r1554845208_

cc @amva13 @kexinhuang12345

abearab commented 5 months ago

For the 1st question, now I can see that some of the scPerturb files are uploaded in TDC dataverse.

image
amva13 commented 5 months ago

closed with https://github.com/mims-harvard/TDC/pull/252 thanks @kexinhuang12345 !

abearab commented 5 months ago

Awesome! Thanks @kexinhuang12345

abearab commented 5 months ago

Hi @kexinhuang12345, as you know ReplogleWeissman2022 study has three datasets.

image

Currently, as I understand ReplogleWeissman2022_K562_gwps data is not uploaded. However, I noticed a weird behavior when I tried to load it! I had ReplogleWeissman2022_k562_essential already downloaded in a path folder and then I tried loading scperturb_gene_ReplogleWeissman2022_K562_gwps and noticed it's saying Found local copy...!

>>> test_load = PerturbOutcome('scperturb_gene_ReplogleWeissman2022_K562_gwps','Datasets')
Found local copy...
Loading...

Looking at the # of perturbations, it's not true for _gwps dataset. It should be 9867 but it's 2058 (this is the same number as _essential dataset)

>>> test_load.adata.obs.perturbation.unique()

Length: 2058

Looking more carefully, I tried an empty folder and noticed for some reason this is downloading wrong file for _gwps.

>>> test_load = PerturbOutcome('scperturb_gene_ReplogleWeissman2022_K562_gwps','Datasets/new/')
Downloading...
█████████████████████████████████████████████| 1.55G/1.55G [01:09<00:00, 22.2MiB/s]
Loading...
~: ls Datasets/new/

scperturb_gene_ReplogleWeissman2022_k562_essential.h5ad

cc @amva13