Update to install instructions & suggestion for generating dummy water sites

MKCarter commented 1 week ago

Hi,

Thanks for this code, it is very interesting.

To install and run I had to make a few modifications to the install instructions:

Firstly install torch and other torch packages along with additional requirements:

pip install torch
pip install torch_cluster==1.6.3 -f https://data.pyg.org/whl/torch-2.5.1+cu124.html
pip install torch_scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.5.1+cu124.html
pip install -r requirements.txt

My requirements.txt looks like this:

biopandas
biopython
pandas
rdkit
torch_geometric
e3nn
spyrmsd
openbabel-wheel
tqdm
wandb
matplotlib
scikit-learn

I had to modify rdkit-pypi to rdkit - as rdkit-pypi will throw the following error:

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/michael/DD_tools/SuperWater/validation_recall_precision.py", line 16, in <module>
    from datasets.pdbbind import PDBBind, NoiseTransform
  File "/home/michael/DD_tools/SuperWater/datasets/pdbbind.py", line 19, in <module>
    from datasets.process_mols import read_molecule, get_rec_graph, generate_conformer, \
  File "/home/michael/DD_tools/SuperWater/datasets/process_mols.py", line 12, in <module>
    from rdkit.Chem import AllChem, GetPeriodicTable, RemoveHs
  File "/home/michael/miniconda3/envs/superwater/lib/python3.11/site-packages/rdkit/Chem/AllChem.py", line 29, in <module>
    from rdkit.Chem.rdMolAlign import *
AttributeError: _ARRAY_API not found

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

Updating to rdkit resolves this.

Once the install was correct, I managed to run an example:

python -m validation_recall_precision --original_model_dir workdir/all_atoms_score_model_res15_17092 --confidence_dir workdir/confidence_model_17092_sigmoid_rr15 --data_dir data/test_dataset --ckpt best_model.pt --all_atoms --run_name evaluation_all_atoms --cache_path data/cache_confidence --split_test data/splits/test.txt --inference_steps 20 --samples_per_complex 1 --batch_size 1 --batch_size_preprocessing 1 --esm_embeddings_path data/test_dataset_embeddings_output --cache_creation_id 1 --cache_ids_to_combine 1 --prob_thresh 0.05 --running_mode test --rmsd_prediction --save_pos
Random seed set as 42
esm_embeddings_path: data/test_dataset_embeddings_output
Processing complexes from [data/splits/test.txt] and saving it to [data/cache_allatoms/limit0_INDEXtest_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings]
Loading 1 complexes.
loading complexes: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.04it/s]
cache path is  data/cache_confidence/model_all_atoms_score_model_res15_17092_split_test_limit_0
Running mode:  test
Add perturbation:  False
common t schedule [1.   0.95 0.9  0.85 0.8  0.75 0.7  0.65 0.6  0.55 0.5  0.45 0.4  0.35
 0.3  0.25 0.2  0.15 0.1  0.05]
water_number/residue_number ratio:  15
resampling steps:  1
total resampling ratio:  15
1it [00:10, 10.73s/it]
HAPPENING | Loading positions and rmsds from cache_id from the path: data/cache_confidence/model_all_atoms_score_model_res15_17092_split_test_limit_0/ligand_positions_1.pkl
Number of complex graphs:  1
Number of RMSDs and positions for the complex graphs:  1
Loading trained confidence model with 2504467 parameters
Starting testing...
  0%|                                                                            | 0/1 [00:00<?, ?it/s]centroids:  259
Saved centroids for 5CGC to inference_out/inferenced_pos_cap0.05/5CGC/5CGC_centroid.txt
Successfully saved PDB file to: inference_out/inferenced_pos_cap0.05/5CGC/5CGC_centroid.pdb

In terms of the dummy water positions, I used the output from a GalaxyWater-CNN calculation.

I imagine this could be intergrated into SuperWater if producing initial water sites is an ongoing issue. At the moment I have this in a seperate conda env - to install, follow the instructions below:

git clone https://github.com/seoklab/GalaxyWater-CNN.git
cd GalaxyWater-CNN
git lfs install

conda create -n gwcnn python=3.9
conda activate gwcnn
pip install torch torchvision scipy

Then to run, you can run something like this: python GWCNN_gpu.py input.pdb output

I did also check the memory requirements for running on my machine, and it seems to use around 10GB of GPU memory for a protein with around 300 residues.

nvidia-smi
Tue Nov 26 22:49:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   45C    P0             36W /  150W |    9555MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2584      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A      7714      C   python                                       9532MiB |
+-----------------------------------------------------------------------------------------+

The predicions were very fast, less than 1 minute runtime.

I hope your exams go well, and thanks for sharing this code! Thanks, Mike

kuangxh9 commented 6 days ago

Hello Mike, Thank you for your interest and feedback on our project. Regarding the dependencies, they may vary depending on the running environment. The code was trained and tested within a Docker container using image: nvcr.io/nvidia/pytorch:23.12-py3. The installation error you encountered might be related to some version mismatch issues. I will check this further.

As for the dummy water position, it is mainly used to ensure successful pre-loading of the protein structure and doesn't affect inference. For reference, I used the following formats for simplicity: <pdb_id>_water.pdb:

HETATM    1  O   HOH A   1      0.000   0.000   0.000  1.00 0.00           O  
TER       2      HOH A   1                                                     
END

<pdb_id>_water.mol2:

@<TRIPOS>MOLECULE  
../Superwater/case_study/5F1K/5F1K_water.pdb  
 1 0 0 0 0  
SMALL  
GASTEIGER  

@<TRIPOS>ATOM  
      1  O         0.0000    0.0000    0.0000  O.3   1    HOH1       0.0000  

@<TRIPOS>BOND

Thank you again for your interest. Please let me know if you have any further questions.

MKCarter commented 6 days ago

I see, thanks for the heads up on the water files. I have re-run with your suggestion and it doesn't affect the output, which is good to know for future runs.

Yeah, for me, installing rdkit using pip install rdkit-pypi installs rdkit-pypi 2022.9.5 - which is likely using an old numpy version. Reinstalling with pip install rdkit installs rdkit 2024.3.6 which seems to work just fine. Thanks, Mike

kuangxh9 / SuperWater

Update to install instructions & suggestion for generating dummy water sites #3