rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
214 stars 58 forks source link

Support for double-cut linkers with 0 atoms #64

Open baoilleach opened 1 month ago

baoilleach commented 1 month ago

I would like these two molecules to be identified as double-cut matched pairs (e.g. with exocyclic fragmentation): image

The larger molecule has [*:1]C[*:2] as the linker, which is fine:

$ mmpdb smifrag "c1ccccc1OC2CCCNC2" --cut-smarts "exocyclic"
                   |------------  variable  ------------|       |-----------------------  constant  ---------------------
#cuts | enum.label | #heavies | symm.class | smiles     | order | #heavies | symm.class | smiles              | with-H
------+------------+----------+------------+------------+-------+----------+------------+---------------------+----------
  1   |     N      |    6     |      1     | *C1CCCNC1  |    0  |     7    |      1     | *Oc1ccccc1          | Oc1ccccc1
  1   |     N      |    7     |      1     | *Oc1ccccc1 |    0  |     6    |      1     | *C1CCCNC1           | C1CCNCC1
  2   |     N      |    1     |     11     | *O*        |   01  |    12    |     12     | *C1CCCNC1.*c1ccccc1 | -
  1   |     N      |    7     |      1     | *OC1CCCNC1 |    0  |     6    |      1     | *c1ccccc1           | c1ccccc1
  1   |     N      |    6     |      1     | *c1ccccc1  |    0  |     7    |      1     | *OC1CCCNC1          | OC1CCCNC1

The smaller molecule should have [*:1][*:2] as the linker, but nothing is found:

$ mmpdb smifrag "c1ccccc1C2CCCNC2" --cut-smarts "exocyclic"
                   |------------  variable  -----------|       |-----------------  constant  ----------------
#cuts | enum.label | #heavies | symm.class | smiles    | order | #heavies | symm.class | smiles    | with-H
------+------------+----------+------------+-----------+-------+----------+------------+-----------+---------
  1   |     N      |    6     |     1      | *C1CCCNC1 |   0   |    6     |     1      | *c1ccccc1 | c1ccccc1
  1   |     N      |    6     |     1      | *c1ccccc1 |   0   |    6     |     1      | *C1CCCNC1 | C1CCNCC1

I'm hoping there's a secret minsize somewhere I can change from 1 to 0, or is support for this likely to require much greater changes?

adalke commented 1 month ago

On Oct 17, 2024, at 19:16, baoilleach @.***> wrote:

I would like these two molecules to be identified as double-cut matched pairs (e.g. with exocyclic fragmentation) ... I'm hoping there's a secret minsize somewhere I can change from 1 to 0, or is support for this likely to require much greater changes?

What a neat case I never considered! No, there's nothing like that option in mmpdb.

I think the quickest way to get something to work is to process the fragdb directly, find all of the 1-cuts, and synthesize the corresponding 2-cut versions. You'll have to ensure the constant_smiles is canonical, which can be approximated by checking if they are different, then testing if X.Y or Y.X is in the database already, and if not picking a preferred order.

This sort of local canonicalization wouldn't work to merge multiple fragdbs, nor as a quick fix to fragment_algorithm.py, but it might get you want you want.

-- Andrew @.***