Closed jyaacoub closed 11 months ago
So it turns out the memory issues I was having with 10554787_EDI_PDBbind_anm_0.0001_20 were due to low CPU/main memory (RAM) and not due to GPU memory.
slurmstepd: error: Detected 1 oom_kill event in StepId=10554787.0. Some of the step tasks have been OOM Killed.
srun: error: node99: task 3: Out Of Memory
submitit ERROR (2023-10-31 12:17:33,974) - Submitted job triggered an exception
Traceback (most recent call last):
File "/cluster/tools/software/centos7/python/3.10.9/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/cluster/tools/software/centos7/python/3.10.9/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/submission.py", line 72, in submitit_main
process_job(args.folder)
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/submission.py", line 65, in process_job
raise error
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/cluster/home/t122995uhn/projects/MutDTA/src/train_test/distributed.py", line 115, in dtrain
logs = train(model=model, train_loader=loaders['train'], val_loader=loaders['val'],
File "/cluster/home/t122995uhn/projects/MutDTA/src/train_test/training.py", line 105, in train
loss.backward()
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.30.3.99]:43662
slurmstepd: error: Detected 1 oom_kill event in StepId=10554787.batch. Some of the step tasks have been OOM Killed.
Same figures as https://github.com/jyaacoub/MutDTA/issues/50#issue-1970773966 but for unique proteins in the dataset:
from src.data_prep.datasets import PDBbindDataset
from src.utils import config as cfg
import pandas as pd
import matplotlib.pyplot as plt
DATASETS = {
'davis': f'{cfg.DATA_ROOT}/DavisKibaDataset/davis/nomsa_binary_original_binary/full/XY.csv',
'kiba': f'{cfg.DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv',
# 'pdbbind': f'{cfg.DATA_ROOT}/PDBbindDataset/nomsa_binary_original_binary/full/XY.csv',
}
fig, axs = plt.subplots(1, len(DATASETS), figsize=(5*len(DATASETS) + len(DATASETS), 5))
for i, dataset in enumerate(DATASETS.keys()):
d0 = pd.read_csv(DATASETS[dataset], index_col=0)
d0['len'] = d0.prot_seq.str.len()
# only get unique proteins
d0 = d0.drop_duplicates(subset='prot_id')
ax = axs[i]
n, bins, patches = ax.hist(d0['len'], bins=20)
print(ax.get_ylim()[0])
for bin_index, bin_edge in enumerate(bins[:-1]):
ax.text(bin_edge + (bins[1] - bins[0])/2, 0, str(bin_index),
ha='center', va='top', color='red', rotation=45)
# Set labels and title
ax.set(xlabel='Protein Sequence length', ylabel='Frequency',
title=f'Histogram of Protein Sequence length ({dataset})')
# Add counts to each bin
for count, x, patch in zip(n, bins, patches):
ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')
cutoff= 370
abv = d0[d0['len'] > cutoff]
print(f"Eliminating codes above {cutoff} length would reduce the dataset by: {len(abv)}")
print(f"\t - Eliminates {len(abv.prot_id.unique())} unique proteins")
Solution is to remove sequences greater than 1500:
Histograms:
PDBbind
Davis
Kiba
Code to replicate