Question regarding phenotype_selection

JihyunKiminGithub commented 2 years ago

Dear BayesTME developers,

Thank you very much again for sharing this software. This is working seamlessly. I have one quick question regarding the "phenotype_selection" command. I initially tried M-fold cross-validation with "python" implementation to determine the suggested number of cell types and lambda but it took too long so I gave up. Instead, command-line ran through generating a folder with a bunch of fold#.h5ad files but did not output the optimal number of cell types and lambda. Is it possible to get the # of suggested cell types and lambda out of the folder/files generated by "phenotype_selection"? I will appreciate it if you could kindly offer help.

Best, Jihyun

jeffquinn-msk commented 2 years ago

Hey Jihyun,

Yes phenotype selection step is the most time intensive part. We are looking at redesigning this method in the near future to be much faster so stay tuned.

If you just ran it on one single computer it will take a very long time to do the parameter sweep. We intend for this part to be distributed over many machines and each of these folds run in parallel. I designed it so that it would be easy to do this via a High Performance Computing Cluster, or via AWS batch or Google cloud run. The phenotype_selection command has the --job-index flag, idea is you would create an array of N jobs and each one would have --job-index set to 0,1,2,3,... so that together they all cover the whole parameter space and all k fold samples.

If I understand you correctly, you were trying to "fake" the output of this step by creating a bunch of fold files so that the pipeline could continue? This isn't really necessary, if you want to skip phenotype selection you can just take a guess at lambda and the number of cell types and specify it to the deconvolution step via the --n-components and --lam2 flags, or do the equivalent via the python API.

Best,

Jeff

tansey commented 2 years ago

Hi Jihyun,

Adding to what Jeff said: if you would like to pick a lambda value, usually we find lambda=1000 to be fairly robust. You're then free to choose the number of cell types that make the most sense to you or to try each one and eyeball what makes sense. There's no way of knowing for sure what the right number is, so even our cross-validation routine is just an educated guess.

If you have paired scRNA from the same tissue, we have a workflow for incorporating that so you don't need to run the cross-validation.

Alternatively, if you have a scRNA atlas from the same type of tissue, but don't have paired scRNA for this specific sample, we're working on adding that workflow and would be happy to test it on your use case if you'd like.

JihyunKiminGithub commented 2 years ago

Dear jeffquinn and tansey,

Thank you for your detailed explanation. It was really helpful.

Best, Jihyun

tansey-lab / bayestme

Question regarding phenotype_selection #52