11:18 AM
Ah I see,I can take the data split from the split function and then return a dictionary of train Val test adata
11:18
Do you think it makes sense to set this as a default for the benchmark? Since I believe most method developer are using adata for model training
11:18 AM
I see
11:19
Ok. Well, let’s make a flag for use_anndata and set it to True by default?
11:19 AM
Sounds good
11:19
I will do that
11:19 AM
I’d rather not get rid of the pandas code
11:19
cool!
11:20 AM
Sounds good
11:20 AM
Sorry for these discrepancies, the lab has been moving away from anndata, so I forget we still currently have some dependencies on it
11:24 AM
I see - no worries but I do want to point out that for most single cell analysis/ML models people still use adata. because there are indeed lots of cell observations (e.g. perturbation) metadata and gene meta data that need to stored. For the ease of use, I feel like we can still prepare an adata flag if people need them!
11:25 AM
absolutely. i’ll add an action item to better expose the getters for anndata
Describe the problem Though self.adata exists, there is no obvious getter method. also, the splits don't provide an anndata option
Describe the solution you'd like getter method(s); also implement splits for anndata as well
Additional context from slack
Oh Is there a function to load that already? Because I checked when we download the raw file it is in the adata format
11:12 AM yes 11:12 https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/[anndata_dataset.py](https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/anndata_dataset.py)#L10
anndata_dataset.py self.adata = self.df # this is in AnnData format https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub 11:12 self.adata will contain the anndata dataframe (edited) 11:12 apologies, i should expose that better via a getter function or something 11:14 The existing loader for perturboutcome inherist from the anndata loader 11:14 https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/single_cell.py#L11
single_cell.py class CellXGeneTemplate(DataLoader): https://github.com/mims-harvard/TDC|mims-harvard/TDCmims-harvard/TDC | Added by GitHub 11:14 https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/perturboutcome.py#L16
perturboutcome.py class PerturbOutcome(CellXGeneTemplate): https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub 11:15 so self.adata will be anndata :slightly_smiling_face: 11:17 though i suppose for the benchmark, the splits are not implemented for anndata
11:18 AM Ah I see,I can take the data split from the split function and then return a dictionary of train Val test adata 11:18 Do you think it makes sense to set this as a default for the benchmark? Since I believe most method developer are using adata for model training
11:18 AM I see 11:19 Ok. Well, let’s make a flag for use_anndata and set it to True by default?
11:19 AM Sounds good 11:19 I will do that
11:19 AM I’d rather not get rid of the pandas code 11:19 cool!
11:20 AM Sounds good
11:20 AM Sorry for these discrepancies, the lab has been moving away from anndata, so I forget we still currently have some dependencies on it
11:24 AM I see - no worries but I do want to point out that for most single cell analysis/ML models people still use adata. because there are indeed lots of cell observations (e.g. perturbation) metadata and gene meta data that need to stored. For the ease of use, I feel like we can still prepare an adata flag if people need them!
11:25 AM absolutely. i’ll add an action item to better expose the getters for anndata