rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
197 stars 55 forks source link

Failure with mmpdb fragment for some specific smiles #30

Open chengthefang opened 3 years ago

chengthefang commented 3 years ago

Hi all,

I am using mmpdb fragment to parse a subset of SureChembl database, and then I found the mmpdb fragment will fail for some specific smiles. I wonder if we could add some error handling to deal with some unfavorable structures.

Here is the example of test.smi.

C[C@]12CCC3c4c5cc(O)cc4[C@@]4(CC[C@@]1(C4)C3CC5)[C@@H]2O SCHEMBL9251776
Oc1ccccc1 phenol
Oc1ccccc1O catechol
Oc1ccccc1N 2-aminophenol
Oc1ccccc1Cl 2-chlorophenol
Nc1ccccc1N o-phenylenediamine
Nc1cc(O)ccc1N amidol
Oc1cc(O)ccc1O hydroxyquinol
Nc1ccccc1 phenylamine
C1CCCC1N cyclopentanol

I ran "python mmpdb/mmpdb fragment test.smi -o test_data.fragments". It failed on parsing the first smiles and won't skip it to continue. The error is shown as below:

Failure: file 'test.smi', line 1, record #1: first line starts 'C[C@]12CCC3c4c5cc(O)cc4[C@@]4(CC[C@@]1(C ...' Traceback (most recent call last): File "mmpdb/mmpdb", line 11, in commandline.main() File "/mmpdb/mmpdblib/commandline.py", line 1054, in main parsed_args.command(parsed_args.subparser, parsed_args) File "/mmpdb/mmpdblib/commandline.py", line 181, in fragment_command do_fragment.fragment_command(parser, args) File "/mmpdb/mmpdblib/do_fragment.py", line 581, in fragment_command writer.write_records(records) File "/mmpdb/mmpdblib/fragment_io.py", line 404, in write_records for rec in fragment_records: File "/mmpdb/mmpdblib/do_fragment.py", line 475, in make_fragment_records fragments = result.get() File "anaconda2/lib/python2.7/multiprocessing/pool.py", line 572, in get raise self._value ValueError: need more than 1 value to unpack

Appreciate any suggestions or ideas.

Thanks, Cheng

KramerChristian commented 3 years ago

Hi Cheng,

thanks for pointing out this issue.

mmpdb does have functionality to skip erroneous SMILES, but this one seems to be another problem - the SMILES is complicated, but chemically correct. The most likely explanation I have so far is that there is an issue with the ring perception for bonds in RDKit. I will do some further tests to make sure I am on the right track, and if I am right, file a bug report in RDKit to solve the issue.

Will keep you posted as this continues.

Bests, Christian

chengthefang commented 3 years ago

Hi Christian,

Thank you so much for looking into this issue. I agree that it might have something to do with the complicated ring system.

Thanks, Cheng

PARODBE commented 1 year ago

Hi Christian,

I can't convert my .smi to fragment for a UTF-8 problem, but i don't understand this because I specify in the code the encoding:

image

And the error:

image

Could you help me please???

KramerChristian commented 1 year ago

Hi Pablo,

I currently do not personally develop mmpdb any more. This is in the hands of @adalke and Jerome Hert. Maybe they can comment?

Bests, Christian

adalke commented 1 year ago

For @chengthefang , I cannot reproduce the problem using mmpdb3, available from https://github.com/adalke/mmpdb . Perhaps some of the changes I did for version 3 resolves your issue?

For @PARODBE , your comment is not connected to this issue. Please use a new issue instead.

It doesn't appear your problem is connected to mmpdb. It appears to be a general RDKit question. At the very least, you don't describe how "cdk2.fragdb" is generated, or the step you did which generates that error message.

My guess is you're showing me how you exported the SDF to SMILES format, which you then converted to a "fragdb" using mmpdb v3.

Version 2 used a text format to store the fragmentations, version 3 switched to sqlite3. You cannot use text processing to read an SQLite3 file as it's a binary format which includes non-UTF8 byte sequences.

PARODBE commented 1 year ago

thanks @adalke ! So...In what format were the saved smiles provided?

adalke commented 1 year ago

It's an SQLite3 file. This is the format specified by the SQLite embedded relational database, and accessible from Python via the sqlite3 module.

The specific schema is at https://github.com/adalke/mmpdb/blob/v3-dev/mmpdblib/fragment_schema.sql .

Your question is not related to issue #30 so please do not continue asking questions in this thread. Also, I am not willing to provide additional support on how use SQL or SQLite. There are many existing teaching resources for those topics.