jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
https://drive.google.com/drive/folders/1mdiA1gf1IjPZNhk79I2cYUu6pwcH0OTD
2 stars 2 forks source link

Feature extraction fixed #7 #12

Closed jyaacoub closed 1 year ago

jyaacoub commented 1 year ago

Fixed issue #7 for kiba dataset, ESM embeddings (see #8) are still way too large to be stored in memory so an alternate solution must be made for it.


Esm embeddings are around 197GB for just the 442 proteins in davis:

# https://huggingface.co/facebook/esm2_t36_3B_UR50D is 11GB
df = pd.read_csv('../data/DavisKibaDataset/davis_msa/processed/XY.csv', index_col=0)
config = EsmConfig.from_pretrained('facebook/esm2_t6_8M_UR50D')
esm_tok = AutoTokenizer.from_pretrained('facebook/esm2_t6_8M_UR50D')
# this will raise a warning since lm head is missing but that is okay since we are not using it:
esm_mdl = EsmModel.from_pretrained('facebook/esm2_t6_8M_UR50D')
prot_seqs = list(df['prot_seq'].unique())
tok = esm_tok(prot_seqs, return_tensors='pt', padding=True)
out = esm_mdl(**tok)
pro_feat = out.last_hidden_state.squeeze() # L x emb_dim
RuntimeError                              Traceback (most recent call last)
/home/jyaacoub/projects/MutDTA/run.py in line 2
      <a href='file:///home/jyaacoub/projects/MutDTA/run.py?line=53'>54</a> # %%
----> <a href='file:///home/jyaacoub/projects/MutDTA/run.py?line=54'>55</a> out = esm_mdl(**tok)
      <a href='file:///home/jyaacoub/projects/MutDTA/run.py?line=55'>56</a>
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate **196789854240 bytes**. Error code 12 (Cannot allocate memory)

196789854240 Bytes == 197 GB