mims-harvard / TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
984 stars 173 forks source link

better expose anndata dataframe in the single-cell dataloaders #267

Open amva13 opened 4 months ago

amva13 commented 4 months ago

Describe the problem Though self.adata exists, there is no obvious getter method. also, the splits don't provide an anndata option

Describe the solution you'd like getter method(s); also implement splits for anndata as well

Additional context from slack

Oh Is there a function to load that already? Because I checked when we download the raw file it is in the adata format

11:12 AM yes 11:12 https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/[anndata_dataset.py](https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/anndata_dataset.py)#L10

anndata_dataset.py self.adata = self.df # this is in AnnData format https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub 11:12 self.adata will contain the anndata dataframe (edited) 11:12 apologies, i should expose that better via a getter function or something 11:14 The existing loader for perturboutcome inherist from the anndata loader 11:14 https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/single_cell.py#L11

single_cell.py class CellXGeneTemplate(DataLoader): https://github.com/mims-harvard/TDC|mims-harvard/TDCmims-harvard/TDC | Added by GitHub 11:14 https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/perturboutcome.py#L16

perturboutcome.py class PerturbOutcome(CellXGeneTemplate): https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub 11:15 so self.adata will be anndata :slightly_smiling_face: 11:17 though i suppose for the benchmark, the splits are not implemented for anndata

11:18 AM Ah I see,I can take the data split from the split function and then return a dictionary of train Val test adata 11:18 Do you think it makes sense to set this as a default for the benchmark? Since I believe most method developer are using adata for model training

11:18 AM I see 11:19 Ok. Well, let’s make a flag for use_anndata and set it to True by default?

11:19 AM Sounds good 11:19 I will do that

11:19 AM I’d rather not get rid of the pandas code 11:19 cool!

11:20 AM Sounds good

11:20 AM Sorry for these discrepancies, the lab has been moving away from anndata, so I forget we still currently have some dependencies on it

11:24 AM I see - no worries but I do want to point out that for most single cell analysis/ML models people still use adata. because there are indeed lots of cell observations (e.g. perturbation) metadata and gene meta data that need to stored. For the ease of use, I feel like we can still prepare an adata flag if people need them!

11:25 AM absolutely. i’ll add an action item to better expose the getters for anndata