open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
92 stars 26 forks source link

`add_rdkit` error out for a particular dataset #672

Closed qai222 closed 1 year ago

qai222 commented 1 year ago

Describe the bug ord_schema.orm.database.add_rdkit error out for a particular dataset.

To Reproduce

from ord_schema.message_helpers import fetch_dataset
from ord_schema.orm.database import add_dataset, add_rdkit
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

connection_string = f"postgresql://127.0.0.1:5432/ord"

dataset_id = "ord_dataset-b195433d5c354ddfb6cde0d53c41910f"
dataset = fetch_dataset(dataset_id)
engine = create_engine(connection_string, future=True)

with Session(engine) as session:
    add_dataset(dataset, session)
    session.flush()
    session.commit()
# INFO 2023-04-14 19:55:36,833 database.py:70: Adding dataset ord_dataset-b195433d5c354ddfb6cde0d53c41910f
# INFO 2023-04-14 19:56:51,821 database.py:73: from_proto() took 74.98789548873901s
# INFO 2023-04-14 19:57:05,244 database.py:76: session.add() took 13.422787189483643s

with Session(engine) as session:
    add_rdkit(session)
    session.commit()
# INFO 2023-04-14 20:00:17,986 database.py:89: Populating RDKit reaction columns
# INFO 2023-04-14 20:00:38,573 database.py:98: Adding reaction took 20.587525844573975s
# INFO 2023-04-14 20:00:38,574 database.py:99: Populating RDKit mol columns
# Traceback (most recent call last):
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1964, in _exec_single_context
#     self.dialect.do_execute(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in do_execute
#     cursor.execute(statement, parameters)
# psycopg2.errors.InternalError_: molFromPickle: invalid value in pickle
# 
# 
# The above exception was the direct cause of the following exception:
# 
# Traceback (most recent call last):
#   File "/home/qai/workplace/LLM_organic_synthesis/setup_db.py", line 66, in <module>
#     ord_add_rdkit()
#   File "/home/qai/workplace/LLM_organic_synthesis/setup_db.py", line 44, in ord_add_rdkit
#     add_rdkit(session)
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/ord_schema/orm/database.py", line 103, in add_rdkit
#     session.execute(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2229, in execute
#     return self._execute_internal(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2133, in _execute_internal
#     result = conn.execute(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1414, in execute
#     return meth(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 486, in _execute_on_connection
#     return connection._execute_clauseelement(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1638, in _execute_clauseelement
#     ret = self._execute_context(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1842, in _execute_context
#     return self._exec_single_context(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1983, in _exec_single_context
#     self._handle_dbapi_exception(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2326, in _handle_dbapi_exception
#     raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1964, in _exec_single_context
#     self.dialect.do_execute(
#   File "/home/qai/local/miniconda3/envs/LLM_organic_synthesis__ord_data/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in do_execute
#     cursor.execute(statement, parameters)
# sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) molFromPickle: invalid value in pickle
# 
# [SQL: UPDATE rdkit.mols SET mol=rdkit.mol_from_smiles(CAST(rdkit.mols.smiles AS cstring)) WHERE rdkit.mols.mol IS NULL]
# (Background on this error at: https://sqlalche.me/e/20/2j85)
skearnes commented 1 year ago

Thanks; I'll take a look at that dataset.

skearnes commented 1 year ago

Hi @qai222, looks like this is related to a known issue in the rdkit cartridge; I'm trying to come up with a workaround: https://github.com/rdkit/rdkit/discussions/4431#discussioncomment-5711515

skearnes commented 1 year ago

I've narrowed this down to ord-b9026cb387d2437ca7b0c276c5ec3713, specifically the reaction input [O-]CC.[Ti+5].[O-]CC.[O-]CC.[O-]CC.[O-]CC

skearnes commented 1 year ago

Also worth noting that this works on AWS Aurora, where the rdkit extension version is 3.8 (instead of 4.0 available on conda)

skearnes commented 2 months ago

Note that there is an explicit skip for this case here: https://github.com/open-reaction-database/ord-schema/blob/main/ord_schema/orm/database.py#L162