Molecular Transformer failed to run

yuxuanou623 commented 5 months ago

Hi John, thanks for updating the Readme. I really appreciate it!

I have encountered a new issue and I'm wondering whether you could help me with it. When I'm running the functional_tests/reaction_predictor_server_checker.py, there is an error occurred. The return value of the request is: {'error': 'Runtime Error: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCBlas.cu:249', 'status': 'error'} The error is: Traceback (most recent call last): File "functional_tests/reaction_predictor_server_checker.py", line 48, in main() File "functional_tests/reaction_predictor_server_checker.py", line 42, in main print(rp(reactants)) File "/rds/user/yo279/hpc-work/project/synthesis-dags/syn_dags/model/reaction_predictors.py", line 55, in call self._run_list_of_reactant_sets(reactant_sets_new_needed))) File "/rds/user/yo279/hpc-work/project/synthesis-dags/syn_dags/model/reaction_predictors.py", line 134, in _run_list_of_reactant_sets raise ex File "/rds/user/yo279/hpc-work/project/synthesis-dags/syn_dags/model/reaction_predictors.py", line 131, in _run_list_of_reactant_sets op_back = return_list[0] KeyError: 0 Besides, the training code for synthesis-dags can run successfully, while the sampling code also has the error above. Thanks for your assistance!

john-bradshaw commented 5 months ago

Hi @yuxuanou623!

From your message it seems that the Molecular Transformer server is not running correctly. (As the ground truth products are available during training, this is only used at inference).

Using your Python environment, are you able to run the Molecular Transformer outside of serving mode, e.g., in training or batch translation mode (using the train.py or translate.py script)?

yuxuanou623 commented 5 months ago

Hi John, thanks for the response. I can run the translate.py successfully. The results are properly generated.

john-bradshaw commented 5 months ago

That's good, but seems strange that the serving does not work!

What do the server logs say (from where you ran server.py) when you run functional_tests/reaction_predictor_server_checker.py?

john-bradshaw commented 5 months ago

And just to confirm when you were running translate.py this was on exactly same machine (same GPU, same CUDA, same Python environment) as when you run server.py?

yuxuanou623 commented 5 months ago

Yes, I run both the server and translate.py on the same machine and within the same Python environment. Here is my log, but I can't identify what's wrong. nohup: ignoring input [2024-04-29 16:43:13,754 INFO] Loading model 0 Pre-loading model 0

Serving Flask app 'server' (lazy loading)
Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
Debug mode: off [2024-04-29 16:51:12,270 WARNING] Running on all addresses. WARNING: This is a development server. Do not use it in a production deployment. [2024-04-29 16:51:12,270 INFO] Running on http://10.43.74.41:5000/ (Press CTRL+C to quit) [2024-04-29 16:51:21,574 INFO] Running translation using 0 Thanks a lot!

john-bradshaw commented 5 months ago

So it's probably worth just double checking that the error really is in the server code (as it seems based on the line numbers from the traceback you shared in the first message that you may have edited reaction_predictors.py?). You can do this by setting up the server, and then running (on another shell):

  curl --header "Content-Type: application/json" \
  --request POST \
  --data '[{"src": "C C O C ( = O ) C 1 C C N ( C ( = O ) O C ( C ) ( C ) C ) C C 1 . C C ( C ) ( C ) O C ( = O ) N 1 C C N C C 1", "id": 0}, {"src": "C [S-] . [Mg+] c 1 c c c ( Cl ) c c 1", "id": 0}]' \
  http://127.0.0.1:5000/translator/translate

(note that you'll need to change 127.0.0.1 to the IP address you're using for the server, which based on your message above seems like 10.43.74.41).

When working this should return:

[[{"n_best":1,"pred_score":-0.004589080810546875,"src":"C C O C ( = O ) C 1 C C N ( C ( = O ) O C ( C ) ( C ) C ) C C 1 . C C ( C ) ( C ) O C ( = O ) N 1 C C N C C 1","tgt":"C C ( C ) ( C ) O C ( = O ) N 1 C C N ( C ( = O ) C 2 C C N ( C ( = O ) O C ( C ) ( C ) C ) C C 2 ) C C 1"},{"n_best":1,"pred_score":-0.0002288818359375,"src":"C [S-] . [Mg+] c 1 c c c ( Cl ) c c 1","tgt":"C S c 1 c c c ( Cl ) c c 1"}]]

If this still doesn't work, it's maybe worth searching the Molecular Transformer/OpenNMT-py GitHub repos to see if anyone else has had the same error from using this code. There is a similar issue on the PyTorch GitHub suggesting that the problem might be from trying to run an incompatible version of CUDA with the GPU you are using. What kind of GPU are you trying to run this on?

yuxuanou623 commented 5 months ago

Hi! Thanks for the response. I am using NVIDIA A100-SXM4-80GB and CUDA Version: 12.2. I use this command and the error is still the same. {"error":"Runtime Error: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCBlas.cu:249","status":"error"} I will check the issue on the Pytorch Github page.

yuxuanou623 commented 5 months ago

I found out that I didn’t successfully run Molecular Transformer on GPU using translate.py, it was running on cpu. May I ask what’s your CUDA, Python and PyTorch version when you run the code recently? Thanks!

john-bradshaw commented 5 months ago

Thanks for clarifying!

Last time I ran the Molecular Transformer I used Python 3.6, pytorch=0.4.1, and cuda92=1.0 (you can see the full environment here), and ran it on a Tesla V100 GPU.

yuxuanou623 commented 5 months ago

Hi John, I try to build the conda environment using conda env create -f conda_mtransformer_gpu.yml. There are several package conflicts. Could you help me with this? Thanks so much. (base) [yo279@login-p-2 synthesis-dags]$ conda env create -f conda_mtransformer_gpu.yml Collecting package metadata (repodata.json): done Solving environment: \ Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. failed
Solving environment: | Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package _libgcc_mutex conflicts for: pytorch=0.4.1 -> libgcc-ng[version='>=5.4.0'] -> _libgcc_mutex[version='|0.1',build=main] numpy=1.18.1 -> libgcc-ng[version='>=7.3.0'] -> _libgcc_mutex[version='|0.1',build=main] python=3.6 -> libgcc-ng[version='>=7.5.0'] -> _libgcc_mutex[version='|0.1',build=main] sqlite=3.31 -> libgcc-ng[version='>=7.3.0'] -> _libgcc_mutex[version='|0.1',build=main]

Package sqlite conflicts for: sqlite=3.31 pytorch=0.4.1 -> python[version='>=3.7,<3.8.0a0'] -> sqlite[version='>=3.20.1,<4.0a0|>=3.22.0,<4.0a0|>=3.23.1,<4.0a0|>=3.24.0,<4.0a0|>=3.25.2,<4.0a0|>=3.25.3,<4.0a0|>=3.26.0,<4.0a0|>=3.27.2,<4.0a0|>=3.29.0,<4.0a0|>=3.30.1,<4.0a0|>=3.31.1,<4.0a0|>=3.33.0,<4.0a0|>=3.35.4,<4.0a0|>=3.36.0,<4.0a0|>=3.38.0,<4.0a0|>=3.39.3,<4.0a0|>=3.40.0,<4.0a0|>=3.40.1,<4.0a0|>=3.30.0,<4.0a0'] pip=20.0.2 -> python[version='>=3.6,<3.7.0a0'] -> sqlite[version='>=3.20.1,<4.0a0|>=3.22.0,<4.0a0|>=3.23.1,<4.0a0|>=3.24.0,<4.0a0|>=3.25.2,<4.0a0|>=3.26.0,<4.0a0|>=3.29.0,<4.0a0|>=3.30.1,<4.0a0|>=3.31.1,<4.0a0|>=3.33.0,<4.0a0|>=3.35.4,<4.0a0|>=3.41.2,<4.0a0|>=3.40.1,<4.0a0|>=3.40.0,<4.0a0|>=3.39.3,<4.0a0|>=3.38.0,<4.0a0|>=3.36.0,<4.0a0|>=3.32.3,<4.0a0|>=3.30.0,<4.0a0|>=3.27.2,<4.0a0|>=3.25.3,<4.0a0'] six=1.14 -> python[version='>=3.9,<3.10.0a0'] -> sqlite[version='>=3.20.1,<4.0a0|>=3.22.0,<4.0a0|>=3.23.1,<4.0a0|>=3.24.0,<4.0a0|>=3.25.2,<4.0a0|>=3.25.3,<4.0a0|>=3.26.0,<4.0a0|>=3.27.2,<4.0a0|>=3.29.0,<4.0a0|>=3.30.1,<4.0a0|>=3.31.1,<4.0a0|>=3.33.0,<4.0a0|>=3.35.4,<4.0a0|>=3.36.0,<4.0a0|>=3.38.0,<4.0a0|>=3.38.2,<4.0a0|>=3.38.3,<4.0a0|>=3.39.2,<4.0a0|>=3.39.3,<4.0a0|>=3.40.0,<4.0a0|>=3.40.1,<4.0a0|>=3.41.2,<4.0a0|>=3.32.3,<4.0a0|>=3.30.0,<4.0a0'] python=3.6 -> sqlite[version='>=3.20.1,<4.0a0|>=3.22.0,<4.0a0|>=3.23.1,<4.0a0|>=3.24.0,<4.0a0|>=3.25.2,<4.0a0|>=3.26.0,<4.0a0|>=3.29.0,<4.0a0|>=3.30.1,<4.0a0|>=3.31.1,<4.0a0|>=3.33.0,<4.0a0|>=3.35.4,<4.0a0']

Package python conflicts for: six=1.14 -> python[version='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0'] pytorch=0.4.1 -> python[version='>=2.7,<2.8.0a0|>=3.5,<3.6.0a0|>=3.7,<3.8.0a0|>=3.6,<3.7.0a0'] pip=20.0.2 -> setuptools -> python[version='>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.12,<3.13.0a0|>=3.9,<3.10.0a0|>=3.11,<3.12.0a0|>=3.5,<3.6.0a0'] pip=20.0.2 -> python[version='>=3.6,<3.7.0a0|>=3.8,<3.9.0a0|>=3.7,<3.8.0a0'] python=3.6 pytorch=0.4.1 -> cffi -> python[version='>=3.10,<3.11.0a0|>=3.12,<3.13.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.11,<3.12.0a0']

Package libedit conflicts for: python=3.6 -> sqlite[version='>=3.33.0,<4.0a0'] -> libedit[version='>=3.1.20170329,<3.2.0a0|>=3.1.20181209,<3.2.0a0|>=3.1.20191231,<3.2.0a0'] sqlite=3.31 -> libedit[version='>=3.1.20181209,<3.2.0a0']

Package ca-certificates conflicts for: pytorch=0.4.1 -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates python=3.6 -> openssl[version='>=1.1.1k,<1.1.2a'] -> ca-certificates

Package numpy conflicts for: pytorch=0.4.1 -> numpy[version='>=1.11.3,<2.0a0'] numpy=1.18.1

Package wheel conflicts for: pip=20.0.2 -> wheel python=3.6 -> pip -> wheelThe following specifications were found to be incompatible with your system:

feature:/linux-64::__glibc==2.17=0
feature:|@/linux-64::__glibc==2.17=0
python=3.6 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
pytorch=0.4.1 -> libgcc-ng[version='>=5.4.0'] -> __glibc[version='>=2.17']
sqlite=3.31 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

john-bradshaw commented 5 months ago

Hey @yuxuanou623!

Sorry you're having trouble using my provided environment!

It's a little hard for me to be fully sure what's going wrong here without access to your machine. I think it might be related to the fact that the pytorch channel no longer hosts version 0.4.1. There is now a version of pytorch 0.4.1 on the anaconda channel (pkgs/main), but it seems to maybe be more limited in terms of what other packages it works with.

You could try:

Adding conda-forge to the list of channels in the environment file I provided (this, along with being able to use pkgs/main through the defaults, seems to allow Conda to solve the environment on a machine I have access to).
Work out how to install Pytorch=0.4.1 using another method (e.g., pip), following the instruction here.
Use the Docker image I provided to run the Molecular Transformer (you can expose the port that the Molecular Transformer runs on so you can send requests to it from outside Docker).

AustinT commented 5 months ago

Hi @john-bradshaw , @yuxuanou623 is an MPhil student working with Miguel and I. We discussed this issue in our meeting so I thought I would comment.

@yuxuanou623 , pytorch versions are changing very quickly, and packages go out of date very quickly. CUDA and pytorch tend to be backward compatible, so I recommend trying to get the code to run with a more modern environment instead of installing old versions of packages.

A simple idea would be to remove the version restrictions on the environment file and see if that works. Try using a file like this:

name: dogae_py310
channels:
  - rdkit
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pip
  - numpy
  - ignite
  - ipython
  - jsonschema
  - jupyter
  - jupyterlab
  - networkx
  - pytorch
  - rdkit
  - tqdm
  - matplotlib
  - pytest
  - scikit-learn
  - pip:
    - docopt
    - keras
    - fcd
    - h5py
    - guacamol
    - ipdb
    - multiset
    - tabulate
    - tensorboard
    - tensorflow
    - jug
    - lazy
    - git+https://github.com/PatWalters/rd_filters.git@451d5cf92ac630df11851bce2dde98609967e5b4

john-bradshaw commented 5 months ago

Hi @AustinT, thanks for commenting!

I’m not actually too sure whether running the Transformer code with a more recent version of PyTorch will actually work; their code was originally created for version 0.4.1, and I remember PyTorch 1.0 created a load of breaking changes.

Think your suggestion of removing the version restrictions for the other packages used (e.g., numpy, sqlite, etc) makes a lot of sense though, especially if Conda still cannot solve the environment even after adding conda-forge as a channel as I suggested above.

Also note that this project requires the installation of two Python environments 🙈, the yml you edited above is for running most of the code in this repo, with conda_mtransformer_gpu.yml defining a seperate environment to run the Molecular Transformer server.

john-bradshaw / synthesis-dags

Molecular Transformer failed to run #4