theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
88 stars 23 forks source link

Biroscak/esm2 embedding computation #160

Closed B1RO closed 4 months ago

B1RO commented 4 months ago

This pull request contains python files used for computation of:

and also some light exploration of the obtained data, so that no information is lost.

The ESM2 embedding computation can be done via the following file: ./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings.py There's also an experimental version that uses FSDP (allow the model to be split onto multiple GPUs) at ./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings_experimental.py The FSDP version should work, but I didn't have the time to play with the optimal splitting of the model, hence it is likely to be slow

The ESM embedding computation works in a batched way, because moving data between CPU and GPU after each protein would be expensive. During computation a log.txt file is kept, so that one can look up what went wrong/where/when a computation got interrupted.

As for the gene embedding computation, it can be found in the:

embeddings/esm2/generate_gene_embeddings

folder. Note that the code in generate_gene_embeddings works with protein embedding files, not with files of batches, so run unwrap.py which does the relevant transformation for you, before running the script.

Finally there's a jupyter notebook exploration.ipynb, that does some sanity checks and some light data exploration. Maybe this file shouldn't be included, but I am including it for completness.

review-notebook-app[bot] commented 4 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB