jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

Add cutoff for sequence length (helps with mem issues with ESM) #50

Closed jyaacoub closed 11 months ago

jyaacoub commented 11 months ago

Solution is to remove sequences greater than 1500:

Histograms:

PDBbind

image

Eliminating codes above 1500 length would reduce the dataset by: 5
     - Eliminates 5 unique proteins

Davis

image

Eliminating codes above 1500 length would reduce the dataset by: 544
     - Eliminates 8 unique proteins

Kiba

image

Eliminating codes above 1500 length would reduce the dataset by: 1452
     - Eliminates 2 unique proteins

Code to replicate

#%%
from src.data_processing.datasets import PDBbindDataset
from src.utils import config as cfg
import pandas as pd
import matplotlib.pyplot as plt

# d0 = pd.read_csv(f'{cfg.DATA_ROOT}/DavisKibaDataset/davis/nomsa_anm/full/XY.csv', index_col=0)
d0 = pd.read_csv(f'{cfg.DATA_ROOT}/DavisKibaDataset/kiba/nomsa_anm/full/XY.csv', index_col=0)
# d0 = pd.read_csv(f'{cfg.DATA_ROOT}/PDBbindDataset/nomsa_anm_original_binary/full/XY.csv', index_col=0)

d0['len'] = d0.prot_seq.str.len()

# %%
n, bins, patches = plt.hist(d0['len'], bins=20)
# Set labels and title
plt.xlabel('Protein Sequence length')
plt.ylabel('Frequency')
plt.title('Histogram of Protein Sequence length (kiba)')

# Add counts to each bin
for count, x, patch in zip(n, bins, patches):
    plt.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')

cutoff= 1500
print(f"Eliminating codes above {cutoff} length would reduce the dataset by: {len(d0[d0['len'] > cutoff])}")
print(f"\t - Eliminates {len(d0[d0['len'] > cutoff].index.unique())} unique proteins")
jyaacoub commented 11 months ago

Increasing from 10GB to 15GB resolved this issue.

So it turns out the memory issues I was having with 10554787_EDI_PDBbind_anm_0.0001_20 were due to low CPU/main memory (RAM) and not due to GPU memory.

Traceback

slurmstepd: error: Detected 1 oom_kill event in StepId=10554787.0. Some of the step tasks have been OOM Killed.
srun: error: node99: task 3: Out Of Memory

submitit ERROR (2023-10-31 12:17:33,974) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/cluster/tools/software/centos7/python/3.10.9/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/cluster/tools/software/centos7/python/3.10.9/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/cluster/home/t122995uhn/projects/MutDTA/src/train_test/distributed.py", line 115, in dtrain
    logs = train(model=model, train_loader=loaders['train'], val_loader=loaders['val'], 
  File "/cluster/home/t122995uhn/projects/MutDTA/src/train_test/training.py", line 105, in train
    loss.backward()
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/cluster/home/t122995uhn/projects/MutDTA/.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.30.3.99]:43662
slurmstepd: error: Detected 1 oom_kill event in StepId=10554787.batch. Some of the step tasks have been OOM Killed.
jyaacoub commented 8 months ago

Same figures as https://github.com/jyaacoub/MutDTA/issues/50#issue-1970773966 but for unique proteins in the dataset:

image

Code

from src.data_prep.datasets import PDBbindDataset
from src.utils import config as cfg
import pandas as pd
import matplotlib.pyplot as plt

DATASETS = {
    'davis': f'{cfg.DATA_ROOT}/DavisKibaDataset/davis/nomsa_binary_original_binary/full/XY.csv',
    'kiba': f'{cfg.DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv',
    # 'pdbbind': f'{cfg.DATA_ROOT}/PDBbindDataset/nomsa_binary_original_binary/full/XY.csv',
}

fig, axs = plt.subplots(1, len(DATASETS), figsize=(5*len(DATASETS) + len(DATASETS), 5))

for i, dataset in enumerate(DATASETS.keys()):
    d0 = pd.read_csv(DATASETS[dataset], index_col=0)

    d0['len'] = d0.prot_seq.str.len()

    # only get unique proteins
    d0 = d0.drop_duplicates(subset='prot_id')

    ax = axs[i]
    n, bins, patches = ax.hist(d0['len'], bins=20)
    print(ax.get_ylim()[0])
    for bin_index, bin_edge in enumerate(bins[:-1]):
        ax.text(bin_edge + (bins[1] - bins[0])/2, 0, str(bin_index), 
                ha='center', va='top', color='red', rotation=45)

    # Set labels and title
    ax.set(xlabel='Protein Sequence length', ylabel='Frequency',
        title=f'Histogram of Protein Sequence length ({dataset})')

    # Add counts to each bin
    for count, x, patch in zip(n, bins, patches):
        ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')

    cutoff= 370
    abv = d0[d0['len'] > cutoff]
    print(f"Eliminating codes above {cutoff} length would reduce the dataset by: {len(abv)}")
    print(f"\t - Eliminates {len(abv.prot_id.unique())} unique proteins")