thodkatz / thesis

0 stars 0 forks source link

The need of architectures that can leverage multiple perturbations #4

Open thodkatz opened 6 days ago

thodkatz commented 6 days ago

Creating pairs for control and stimulated per perturbation, and to train a model for that specific perturbation requires a substantial amount of cells. Most of the perturbation datasets that contain multiple perturbations, they contain a few stimulated cells. Thus it is challenging to generalize with few samples.

On the other hand to leverage multi task learning, we need to solve multiple tasks. Having as input control and perturbation, could be useful to have more than perturbation response prediction task. We could use the pairs to classify the perturbation type for example. Or pathway prediction, target prediction. Is it a valid task to use a pair of control and stimulated gene expression profile and infer the target protein?

Most of the tasks solved by other type of frameworks such as CPA, and scVIDR, they solved themwith post-processing. The model attempts to reconstruct, and along the way the latent space is structured in such a way to be leveraged.

For multi-task learning, we need to bring the task to the architecture (right?). Otherwise, how the model will actually do multi task learning? To do that we need as input information that can be leveraged for multiple tasks. Frameworks that rely on reconstruction, and have as one input stream all the control and perturbations, without any structure, it makes it not feasible to create structured tasks as outputs.

Architectures like scButterfly could give us insight. But this one suffers for datasets with multiple perturbation but a few samples. It can work for big datasets with one perturbation. In our case of exploring multiple ones, doesn't seem to be so helpful. The results show poor performance on number of common DEGs, but on the other hand when I do evaluation on the train and valid sets it performs pretty well. So the issue could be actually indeed the number of cells, or the inability to generalize on unseen cell type.

I could also try augmentations for the scButterfly. But I would like to explore scPreGan because it satisfies the input requirements, having as input perturbation, and doesn't need to have pairs. The architecture seems also straightforward.

Architectures that don't solve tasks with post-processing of the latent space:

Tasks:

It should be noted that:

Existing methods for this task, such as scGen17 and scPreGAN18, do not account for batch effects between data matrices and assume differences in cell distribution between data matrices solely result from biological conditions, an assumption that does not hold for most real datasets. Furthermore, in practice, there is often more than one type of condition in the data, but existing methods are designed for only one type of condition. source: https://www.nature.com/articles/s41467-024-45227-w

thodkatz commented 6 days ago

Check this dataset

Nault et al.23 performed all TCDD liver dose-response experiments, which were deposited in the Gene Expression Omnibus (GEO)59 under the accession number GSE184506 used by scVIDR evalution #6

to maybe resolve the issue of a few samples.