SMIRKY scoring problem caused by repeat molecules

bannanc commented 7 years ago

Last week I noticed a problem with scoring smirky simulations on MiniDrugBank.

For example, in MiniDrugBank there are 363 molecules with 15,564 atoms If I type with smirnoff99Frosst, there are 3,354 atoms that match n15 ([*:1]).

Here are the scores for the first iteration of a MiniDrugBank simulation with

PROPOSED:
VdW type matches:
0: [*:1]                                                         matches                                                      n15: [#6:1]:     3291 VdW        types matched
3291 / 15259 total VdWs match (21.568 %)
INDEX          VDWS  MOLECULES   TYPE NAME: SMIRKS                                  REF TYPE: SMIRKS                                   FRACTION OF REF TYPED MOLECULES MATCHED
    1 :      15259        355 | 0: [*:1]                                           n15: [#6:1]                                           3291 /    3354 ( 98.122%)
TOTAL :      15259        363 |                                                       3291 /    15259 match (21.568 %)

Some key things to notice:

355 Molecules that are matched with [*:1] which should have been 363 molecules!
There are 3,291 VdW matches to [*:1] which should have been 3,354 atoms since the generic VdW should match ALL the VdW types in ALL the molecules.

I did some investigating today, which included running a lot of the methods in smirky in a jupyter notebook to see where the numbers worked and when they didn't.

The conclusion: MiniDrugBank has repeated molecules! and SMIRKY uses smiles strings as keys in a dictionary to track how molecules are typed with the current list.

Per a discussion on FreeSolv, I know that OECreateIsoSmiStrings don't take into account stereochemistry so I tried replacing all occurrences of those with OEMolToSmiles so see what happened. With that switch, I get a key error which we already "fixed" with issue#223

I did a check outside of smirky with OEMolToSmiles and found it matched 360 of 363 molecules which means there repeating molecules in the set, even considering stereochemistry.

I've been thinking about fixes here,

Should we change MiniDrugBank again to remove these repeated molecules?
SMIRKY should probably do something else to track molecules, maybe use the index for the molecule as a key in the dictionaries instead?

bannanc commented 7 years ago

heads up @davidlmobley and @camizanette

davidlmobley commented 7 years ago

Seems like we (ugh) want MiniDrugBank not to have repeating molecules, @bannanc . I hate to have to change it again at this stage, but it's probably not a huge loss (unless I'm missing something) since none of our data is final yet...

Relating to this:

SMIRKY should probably do something else to track molecules, maybe use the index for the molecule as a key in the dictionaries instead?

Remind me what the key is now??

jchodera commented 7 years ago

This is only tangentially related, but MiniDrugBank seems so useful on its own that I wonder if it makes sense for it to graduate to its own versioned repo!

bannanc commented 7 years ago

This is only tangentially related, but MiniDrugBank seems so useful on its own that I wonder if it makes sense for it to graduate to its own versioned repo!

@jchodera I like this idea. It could be particularly useful if we want to remake MiniDrugBank when we make changes to smirnoff99Frosst since I haven't rebuilt it since the last time we updated the force field.

Remind me what the key is now??

@davidlmobley It since you can't label OEMolecules with data on more than one atom, smirky uses a dictionary with the form: { moleculeSMILES: { (tuple of indices): 'pid/typename'... } ... } To track "typed" molecules with current set of parameters, but the reference numbers are stored at the beginning in a dictionary with just the totals: {'pid': total_count...} where it includes the counts for the repeated molecules.

Seems like we (ugh) want MiniDrugBank not to have repeating molecules, @bannanc . I hate to have to change it again at this stage, but it's probably not a huge loss (unless I'm missing something) since none of our data is final yet...

Yes I think you're right, but if we're going to use SMILES as keys I think we also have to remove repeated molecules from the input OR we use something else as a key in that dictionary.

davidlmobley commented 7 years ago

@bannanc -- John's point on MiniDrugBank is actually a good one. Probably worth having when you have spare time. That way we can get it a DOI, etc. It could effectively serve as a "minipublication" -- I can advertise on Twitter, etc., once you have a suitable README.md in place, and we can get a DOI via Zenodo.

davidlmobley commented 7 years ago

(I can also post on eScholarship which is indexed by Google Scholar).

bannanc commented 7 years ago

John's point on MiniDrugBank is actually a good one. Probably worth having when you have spare time. That way we can get it a DOI, etc. It could effectively serve as a "minipublication" -- I can advertise on Twitter, etc., once you have a suitable README.md in place, and we can get a DOI via Zenodo.

@davidlmobley I can migrate it next week a couple of questions

Should I also move the filter script/ipynb (thats the only thing we have officially used it for so far).
Where should I put it MobleyGroup or open-forcefield-group?

jchodera commented 7 years ago

I'd move the filter scripts as well and set up integration testing. Tag us if you need help with the devtools stuff.

If you use releases + Zenodo, we can change the SMIRNOFF testing to pull down a specific version of MiniDrugBank just before testing.

jchodera commented 7 years ago

I can see a lot of other projects getting mileage out of MiniDrugBank too!

bannanc commented 7 years ago

Ok, so we're making the MiniDrugBank repo and removing duplicates there.

However, we didn't actually establish how to fix SMIRKY, I see two ways forward:

Check for repeating molecules inside smirky, then remove them.
Change dictionary keys to something besides SMILES strings (this would require substantially more work)

@davidlmobley and I discussed on slack and agree that the best solution for the smirky tool is to just check for repeating molecules. I think future move proposal tools should not index with SMILES strings.

bannanc commented 7 years ago

This was fixed by PR #254

openforcefield / smarty

SMIRKY scoring problem caused by repeat molecules #253