Biroscak/esm2 embedding computation

This pull request contains python files used for computation of:

embeddings
gene maps

and also some light exploration of the obtained data, so that no information is lost.

The ESM2 embedding computation can be done via the following file: ./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings.py There's also an experimental version that uses FSDP (allow the model to be split onto multiple GPUs) at ./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings_experimental.py The FSDP version should work, but I didn't have the time to play with the optimal splitting of the model, hence it is likely to be slow

The ESM embedding computation works in a batched way, because moving data between CPU and GPU after each protein would be expensive. During computation a log.txt file is kept, so that one can look up what went wrong/where/when a computation got interrupted.

As for the gene embedding computation, it can be found in the:

embeddings/esm2/generate_gene_embeddings

folder. Note that the code in generate_gene_embeddings works with protein embedding files, not with files of batches, so run unwrap.py which does the relevant transformation for you, before running the script.

Finally there's a jupyter notebook exploration.ipynb, that does some sanity checks and some light data exploration. Maybe this file shouldn't be included, but I am including it for completness.

theislab / chemCPA

Biroscak/esm2 embedding computation #160