rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
214 stars 58 forks source link

How to get the chemical information for each rule environment with a given rule? #26

Closed chemcomp-Jen closed 3 years ago

chemcomp-Jen commented 3 years ago

Hi all,

I built a mmpDB using the example "test_data.smi" shown on the github page. Then, I run a query to grep all the rules from the schema as follows: (please correct me if I did something wrong with the query)

c = cursor.execute(
            "SELECT rule_environment.rule_id, from_smiles.smiles, from_smiles.num_heavies, to_smiles.smiles, to_smiles.num_heavies, "
            "          rule_environment.radius, "
            "          rule_environment_statistics.id, property_name_id, count, avg, std, kurtosis, skewness, min, q1, median, q3, max, paired_t, p_value "
            "  FROM rule, rule_environment, rule_environment_statistics, "
            "          rule_smiles as from_smiles, rule_smiles as to_smiles "
            " WHERE rule_environment.id = rule_environment_id "
            "   AND rule_environment_statistics.rule_environment_id = rule_environment_id "
            "   AND rule_environment.rule_id = rule.id "
            "   AND rule.from_smiles_id = from_smiles.id "
            "   AND rule.to_smiles_id = to_smiles.id ")

After that, I took a look at all rules. I found it is difficult to understand the rule environments for the same rule.

For example below, for the rule id 0, there are 11 environments with different rule-environment-id. But how could I get the local chemical information for each environment? That will help me understand the difference between those environments.

rule_id from_smiles from_smiles_nHeavies to_smiles to_smiles_nHeavies environ_radius rule_environ_id prop_id count avg
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 0 1 0 2 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 1 3 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 2 5 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 3 7 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 4 9 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 5 11 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 1 214 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 2 216 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 3 218 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 4 220 0 1 1
0 [*:1]c1ccccc1N 7 [*:1]c1ccccc1O 7 5 222 0 1 1

Thanks, Jen

KramerChristian commented 3 years ago

Dear Jen,

If I understand you correctly, you would like to see something like a SMILES or SMARTS for a given environment. This is currently not stored in the DB, so there is no direct way to access it.

A workaround could be that for a given rule_environment, you extract one sample pair and display this, together with the radius (as integer). If the from_smiles and to_smiles are known, the environment for a given radius can then be inferred from looking at the pair.

Please note that the environment is grouped via the Morgan fingerprint, rooted at the attachment atom. The Morgan Fingerprint for each atom amongst others features the number of neighboring atoms. Therefore, a radius 1 Fingerprint in reality is more a kind of radius 1.5 fingerprint. For this reason, if you wanted to depict the Morgan Fingerprint environment as a SMILES, this would get be rather complicated.

Does this answer your question?

chemcomp-Jen commented 3 years ago

Hi Christian,

Thanks for the reply. Yes, I would like to see the SMILES or SMARTS for a given environment. Good to know there is a workaround. I think it should work.

I will give a try to extract one example pair/or all pairs (if possible) from a given rule_environment. Would you mind give an example of how to do that? Is there any existing functions in the package I can use?

Thanks, Jen

KramerChristian commented 3 years ago

Hi Jen,

there is no existing function that does exactly what you ask. But if you are coding already anyway, and you know the rule_environment_id for which you want to see a pair, you could try something like this:

sql = """ SELECT pair.id, lhs_compound.clean_smiles, rhs_compound.clean_smiles, lhs_compound.public_id, rhs_compound.public_id FROM pair, compound as lhs_compound, compound as rhs_compound WHERE pair.rule_environment_id = ? AND pair.compound1_id = lhs_compound.id AND pair.compound2_id = rhs_compound.id LIMIT 1 """ c = cursor.execute(sql, (environment_id, ))

Disclaimer: The concept of the statement should work, but I have not tested it and my python has gotten a little rusty lately, you may need to fix some of the commas etc. Please let me know if this works.

chemcomp-Jen commented 3 years ago

Hi Christians,

Thank you so much for the details. I will code the query and try to grab the pair information. I will keep you posted how it works.

Best, Jen

chemcomp-Jen commented 3 years ago

Hi Christian, It works as expected!

Thanks, Jen.