Open ocmadin opened 4 years ago
Not sure the question of "what data points should we use" is dramatically different from "how should we divide them"... what am I missing?
How many data points can we afford, given likely compute resources @SimonBoothroyd ?
In terms of how to divide them, I would want to build a diverse training set that includes adequate representation of all parameters we're fitting (I have some code we used for selecting Parsley benchmarking sets that does some clustering that could perhaps help be adapted to assist with this) in diverse chemical contexts. However, you also want to ensure that it's not dramatically overrepresenting certain things -- eg one gripe I had with the initial attempt prior to Parsley (which was probably largely driven by data availability) is that it was overenriched in halogens, which could have introduced biases.
We might want to deal with this by checking parameter usage in a large drug-like set (eg Enamine? A cleaned DrugBank?) and then trying to ensure the set we pick for training (and also testing/validation) does not dramatically overrepresent certain parameters relative to that.
Pure Compound Data Selection: Our sources for pure data include ThermoML, the DIPPR free set, and (hopefully) the curated DIPPR subset that we are currently talking to them about. Broadly, which data points should we use from these sources, and how should we divide them into test/train/validation sets?