The need of architectures that can leverage multiple perturbations

Creating pairs for control and stimulated per perturbation, and to train a model for that specific perturbation requires a substantial amount of cells. Most of the perturbation datasets that contain multiple perturbations, they contain a few stimulated cells. Thus it is challenging to generalize with few samples.

On the other hand to leverage multi task learning, we need to solve multiple tasks. Having as input control and perturbation, could be useful to have more than perturbation response prediction task. We could use the pairs to classify the perturbation type for example. Or pathway prediction, target prediction. Is it a valid task to use a pair of control and stimulated gene expression profile and infer the target protein?

Most of the tasks solved by other type of frameworks such as CPA, and scVIDR, they solved themwith post-processing. The model attempts to reconstruct, and along the way the latent space is structured in such a way to be leveraged.

For multi-task learning, we need to bring the task to the architecture (right?). Otherwise, how the model will actually do multi task learning? To do that we need as input information that can be leveraged for multiple tasks. Frameworks that rely on reconstruction, and have as one input stream all the control and perturbations, without any structure, it makes it not feasible to create structured tasks as outputs.

Architectures like scButterfly could give us insight. But this one suffers for datasets with multiple perturbation but a few samples. It can work for big datasets with one perturbation. In our case of exploring multiple ones, doesn't seem to be so helpful. The results show poor performance on number of common DEGs, but on the other hand when I do evaluation on the train and valid sets it performs pretty well. So the issue could be actually indeed the number of cells, or the inability to generalize on unseen cell type.

I could also try augmentations for the scButterfly. But I would like to explore scPreGan because it satisfies the input requirements, having as input perturbation, and doesn't need to have pairs. The architecture seems also straightforward.

Architectures that don't solve tasks with post-processing of the latent space:

scButterfly
scPreGan
CODEX Explore more architectures here https://github.com/xianglin226/Benchmarking-Single-Cell-Perturbation

Tasks:

[ ] Try to still work on scButterfly, experiment wit the augmentation
[ ] Would be useful to state that scButterfly can maybe work with a few samples of unseen cells of seen cell types
[ ] Explore scPreGan

It should be noted that:

Existing methods for this task, such as scGen17 and scPreGAN18, do not account for batch effects between data matrices and assume differences in cell distribution between data matrices solely result from biological conditions, an assumption that does not hold for most real datasets. Furthermore, in practice, there is often more than one type of condition in the data, but existing methods are designed for only one type of condition. source: https://www.nature.com/articles/s41467-024-45227-w

thodkatz / thesis

The need of architectures that can leverage multiple perturbations #4