Closed bannanc closed 7 years ago
heads up @davidlmobley and @camizanette
Seems like we (ugh) want MiniDrugBank not to have repeating molecules, @bannanc . I hate to have to change it again at this stage, but it's probably not a huge loss (unless I'm missing something) since none of our data is final yet...
Relating to this:
SMIRKY should probably do something else to track molecules, maybe use the index for the molecule as a key in the dictionaries instead?
Remind me what the key is now??
This is only tangentially related, but MiniDrugBank seems so useful on its own that I wonder if it makes sense for it to graduate to its own versioned repo!
This is only tangentially related, but MiniDrugBank seems so useful on its own that I wonder if it makes sense for it to graduate to its own versioned repo!
@jchodera I like this idea. It could be particularly useful if we want to remake MiniDrugBank when we make changes to smirnoff99Frosst since I haven't rebuilt it since the last time we updated the force field.
Remind me what the key is now??
@davidlmobley It since you can't label OEMolecules with data on more than one atom, smirky uses a dictionary with the form:
{ moleculeSMILES: { (tuple of indices): 'pid/typename'... } ... }
To track "typed" molecules with current set of parameters, but the reference numbers are stored at the beginning in a dictionary with just the totals:
{'pid': total_count...}
where it includes the counts for the repeated molecules.
Seems like we (ugh) want MiniDrugBank not to have repeating molecules, @bannanc . I hate to have to change it again at this stage, but it's probably not a huge loss (unless I'm missing something) since none of our data is final yet...
Yes I think you're right, but if we're going to use SMILES as keys I think we also have to remove repeated molecules from the input OR we use something else as a key in that dictionary.
@bannanc -- John's point on MiniDrugBank is actually a good one. Probably worth having when you have spare time. That way we can get it a DOI, etc. It could effectively serve as a "minipublication" -- I can advertise on Twitter, etc., once you have a suitable README.md in place, and we can get a DOI via Zenodo.
(I can also post on eScholarship which is indexed by Google Scholar).
John's point on MiniDrugBank is actually a good one. Probably worth having when you have spare time. That way we can get it a DOI, etc. It could effectively serve as a "minipublication" -- I can advertise on Twitter, etc., once you have a suitable README.md in place, and we can get a DOI via Zenodo.
@davidlmobley I can migrate it next week a couple of questions
Should I also move the filter script/ipynb (thats the only thing we have officially used it for so far).
Where should I put it MobleyGroup or open-forcefield-group?
I'd move the filter scripts as well and set up integration testing. Tag us if you need help with the devtools stuff.
If you use releases + Zenodo, we can change the SMIRNOFF testing to pull down a specific version of MiniDrugBank just before testing.
I can see a lot of other projects getting mileage out of MiniDrugBank too!
Ok, so we're making the MiniDrugBank repo and removing duplicates there.
However, we didn't actually establish how to fix SMIRKY, I see two ways forward:
@davidlmobley and I discussed on slack and agree that the best solution for the smirky tool is to just check for repeating molecules. I think future move proposal tools should not index with SMILES strings.
This was fixed by PR #254
Last week I noticed a problem with scoring smirky simulations on MiniDrugBank.
For example, in MiniDrugBank there are 363 molecules with 15,564 atoms If I type with smirnoff99Frosst, there are 3,354 atoms that match n15 (
[*:1]
).Here are the scores for the first iteration of a MiniDrugBank simulation with
Some key things to notice:
[*:1]
which should have been 363 molecules![*:1]
which should have been 3,354 atoms since the generic VdW should match ALL the VdW types in ALL the molecules.I did some investigating today, which included running a lot of the methods in smirky in a jupyter notebook to see where the numbers worked and when they didn't.
The conclusion: MiniDrugBank has repeated molecules! and SMIRKY uses smiles strings as keys in a dictionary to track how molecules are typed with the current list.
Per a discussion on FreeSolv, I know that OECreateIsoSmiStrings don't take into account stereochemistry so I tried replacing all occurrences of those with OEMolToSmiles so see what happened. With that switch, I get a key error which we already "fixed" with issue#223
I did a check outside of smirky with OEMolToSmiles and found it matched 360 of 363 molecules which means there repeating molecules in the set, even considering stereochemistry.
I've been thinking about fixes here,