This PR makes the following changes to the data set selection algorithm:
An option has been added to allowing choosing the desired number of each type of property which should exercise each vdW smirks. The default for this option has been set to 2 - i.e. we would ideally have at least two data points per type of property per smirks pattern to be exercised.
Dielectric data is no longer included. It is unclear how much dielectric constants will improve the optimised parameters relative to the increased simulation cost. Further, it appears that the gradient of the dielectric constant with respect to the parameters being optimised is significantly more noisy than both densities and enthalpies of mixing, which could be detrimental to the optimisation procedure.
The allowed elements has been reduced to just those for which there is a reasonable amount of data, namely 'H', 'N', 'C', 'O', 'S', 'F', 'Cl', 'Br', 'I'. In particular, 'P' was removed for the reason of lack of data.
Molecules with a net charge are now properly filtered out.
Additionally, this PR:
Refactors the reporting methods into a separate utility, and now includes a table which indicates how many data points exercise each smirks pattern.
Swaps the repo to use pint.
Prints out which smirks are 'sufficiently exercised' to be optimised against.
Description
This PR makes the following changes to the data set selection algorithm:
An option has been added to allowing choosing the desired number of each type of property which should exercise each vdW smirks. The default for this option has been set to 2 - i.e. we would ideally have at least two data points per type of property per smirks pattern to be exercised.
Dielectric data is no longer included. It is unclear how much dielectric constants will improve the optimised parameters relative to the increased simulation cost. Further, it appears that the gradient of the dielectric constant with respect to the parameters being optimised is significantly more noisy than both densities and enthalpies of mixing, which could be detrimental to the optimisation procedure.
The allowed elements has been reduced to just those for which there is a reasonable amount of data, namely
'H', 'N', 'C', 'O', 'S', 'F', 'Cl', 'Br', 'I'
. In particular,'P'
was removed for the reason of lack of data.Molecules with a net charge are now properly filtered out.
Additionally, this PR:
Status