This pull request contains python files used for computation of:
embeddings
gene maps
and also some light exploration of the obtained data, so that no information is lost.
The ESM2 embedding computation can be done via the following file:
./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings.py
There's also an experimental version that uses FSDP (allow the model to be split onto multiple GPUs) at
./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings_experimental.py
The FSDP version should work, but I didn't have the time to play with the optimal splitting of the model, hence it is likely to be slow
The ESM embedding computation works in a batched way, because moving data between CPU and GPU after each protein would be expensive. During computation a log.txt file is kept, so that one can look up what went wrong/where/when a computation got interrupted.
As for the gene embedding computation, it can be found in the:
embeddings/esm2/generate_gene_embeddings
folder. Note that the code in generate_gene_embeddings works with protein embedding files, not with files of batches,
so run unwrap.py which does the relevant transformation for you, before running the script.
Finally there's a jupyter notebook exploration.ipynb, that does some sanity checks and some light data exploration.
Maybe this file shouldn't be included, but I am including it for completness.
This pull request contains python files used for computation of:
and also some light exploration of the obtained data, so that no information is lost.
The ESM2 embedding computation can be done via the following file: ./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings.py There's also an experimental version that uses FSDP (allow the model to be split onto multiple GPUs) at ./embeddings/esm2/compute_protein_embeddings/compute_protein_embeddings_experimental.py The FSDP version should work, but I didn't have the time to play with the optimal splitting of the model, hence it is likely to be slow
The ESM embedding computation works in a batched way, because moving data between CPU and GPU after each protein would be expensive. During computation a log.txt file is kept, so that one can look up what went wrong/where/when a computation got interrupted.
As for the gene embedding computation, it can be found in the:
embeddings/esm2/generate_gene_embeddings
folder. Note that the code in generate_gene_embeddings works with protein embedding files, not with files of batches, so run unwrap.py which does the relevant transformation for you, before running the script.
Finally there's a jupyter notebook exploration.ipynb, that does some sanity checks and some light data exploration. Maybe this file shouldn't be included, but I am including it for completness.