Feature extraction fixed #7

Fixed issue #7 for kiba dataset, ESM embeddings (see #8) are still way too large to be stored in memory so an alternate solution must be made for it.

Best idea so far is to only store tokens in the dataset and then during training run it through the esm model to get the embedding.

Esm embeddings are around 197GB for just the 442 proteins in davis:

# https://huggingface.co/facebook/esm2_t36_3B_UR50D is 11GB
df = pd.read_csv('../data/DavisKibaDataset/davis_msa/processed/XY.csv', index_col=0)
config = EsmConfig.from_pretrained('facebook/esm2_t6_8M_UR50D')
esm_tok = AutoTokenizer.from_pretrained('facebook/esm2_t6_8M_UR50D')
# this will raise a warning since lm head is missing but that is okay since we are not using it:
esm_mdl = EsmModel.from_pretrained('facebook/esm2_t6_8M_UR50D')
prot_seqs = list(df['prot_seq'].unique())
tok = esm_tok(prot_seqs, return_tensors='pt', padding=True)
out = esm_mdl(**tok)
pro_feat = out.last_hidden_state.squeeze() # L x emb_dim

RuntimeError                              Traceback (most recent call last)
/home/jyaacoub/projects/MutDTA/run.py in line 2
      <a href='file:///home/jyaacoub/projects/MutDTA/run.py?line=53'>54</a> # %%
----> <a href='file:///home/jyaacoub/projects/MutDTA/run.py?line=54'>55</a> out = esm_mdl(**tok)
      <a href='file:///home/jyaacoub/projects/MutDTA/run.py?line=55'>56</a>
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate **196789854240 bytes**. Error code 12 (Cannot allocate memory)

196789854240 Bytes == 197 GB

jyaacoub / MutDTA

Feature extraction fixed #7 #12