Closed barmoral closed 1 month ago
I can reproduce this; there must be something different about this dataset that causes the parsing to fail in ways that the other supported properties do not. Or maybe it's not correctly being loaded as a plugin
Whatever's going wrong is surfacing from here: https://github.com/openforcefield/openff-evaluator/blob/ca084dfa9f1d6531f1dac5d92124b15429429449/openff/evaluator/datasets/thermoml/thermoml.py#L2188
It can't be that all identifiers are missed, otherwise it wouldn't think everything was pure water
Here's the script I'm using to test, based on what you shared:
from openff.units import unit
from openff.evaluator.datasets import PhysicalProperty, PropertyPhase
from openff.evaluator.datasets.thermoml import thermoml_property
from openff.evaluator.datasets.thermoml.thermoml import ThermoMLDataSet
from openff.evaluator.plugins import register_default_plugins, register_external_plugins
@thermoml_property(
"Osmotic coefficient",
supported_phases=PropertyPhase.Liquid | PropertyPhase.Gas,
)
class OsmoticCoefficient(PhysicalProperty):
def default_unit(cls):
return unit.dimensionless
register_default_plugins()
register_external_plugins()
ThermoMLDataSet._from_url(
"https://trc.nist.gov/ThermoML/10.1016/j.fluid.2006.09.025.xml"
)
I traced this ultimately back to an incorrect calculation of MW, meaning that the non-O compound gets dropped at the lines below due to an apparent mole fraction around 1e-27. Raising #569 to fix.
With the current development head (which would land in 0.4.10, most likely) including @lilyminium's recent fix, I think this is doing what one would expect? I blindly copied my code snippet from earlier
In [26]: df = ds.to_pandas()
In [27]: df.describe()
Out[27]:
Temperature (K) N Components Mole Fraction 1 Mole Fraction 2 OsmoticCoefficient Value () OsmoticCoefficient Uncertainty ()
count 241.00 241.0 241.000000 241.000000 241.000000 241.000000
mean 298.15 2.0 0.011742 0.988258 0.651477 0.008793
std 0.00 0.0 0.010405 0.010405 0.211759 0.005274
min 298.15 2.0 0.000855 0.948725 0.219100 0.000550
25% 298.15 2.0 0.003139 0.982043 0.530000 0.004300
50% 298.15 2.0 0.008380 0.991620 0.662500 0.008450
75% 298.15 2.0 0.017957 0.996861 0.833900 0.011900
max 298.15 2.0 0.051275 0.999145 0.977700 0.019500
In [28]: df.head()
Out[28]:
Id Temperature (K) Pressure (kPa) Phase N Components ... Mole Fraction 2 Exact Amount 2 OsmoticCoefficient Value () OsmoticCoefficient Uncertainty () Source
0 c2e7b442254f4541b41b0869241d66b1 298.15 None Liquid + Gas 2 ... 0.999140 None 0.7389 0.00655 10.1016/j.fluid.2006.09.025
1 befcc793e1054dd38b5df717d6603b95 298.15 None Liquid + Gas 2 ... 0.998963 None 0.7142 0.00715 10.1016/j.fluid.2006.09.025
2 8768e8a84b6d4267b4f884d95fbece95 298.15 None Liquid + Gas 2 ... 0.998622 None 0.6730 0.00820 10.1016/j.fluid.2006.09.025
3 e2acf2ede41b444ea445e66b5ebb5f83 298.15 None Liquid + Gas 2 ... 0.998378 None 0.6485 0.00880 10.1016/j.fluid.2006.09.025
4 d8c3e030b0ff49baad2ebcb2c62444a6 298.15 None Liquid + Gas 2 ... 0.998211 None 0.6324 0.00925 10.1016/j.fluid.2006.09.025
[5 rows x 16 columns]
In [29]: df['Component 1']
Out[29]:
0 CC[N+](C)(CC)CC.[I-]
1 CC[N+](C)(CC)CC.[I-]
2 CC[N+](C)(CC)CC.[I-]
3 CC[N+](C)(CC)CC.[I-]
4 CC[N+](C)(CC)CC.[I-]
...
236 CCCCCCC[N+](CC)(CC)CC.[I-]
237 CCCCCCC[N+](CC)(CC)CC.[I-]
238 CCCCCCC[N+](CC)(CC)CC.[I-]
239 CCCCCCC[N+](CC)(CC)CC.[I-]
240 CCCCCCC[N+](CC)(CC)CC.[I-]
Name: Component 1, Length: 241, dtype: object
I haven't worked with this data, but I see
Describe the bug I'm trying to use evaluator to filter papers with Osmotic Coefficient values from ThermoML. I've succesfully created the property type, filtered out dois with osmotic coefficients, converted them to a pandas dataframe, and printed the dataframe into a csv file. However, evaluator is not recognizing or reading all of the substances involved from the papers. It only recognizes one component, even if the thermoml .xml data does report other identifiers (StandardInChI, CommonName).
To Reproduce
Register Custom ThermoML Property:
Load ThermoML Data Set:
ds = ThermoMLDataSet.from_doi('10.1016/j.fluid.2006.09.025')
Write to csv:
Check involved compounds:
ds.substances
If the problem involves a specific molecule or file, please upload that as well. --> filt_ds_osmcoeff.csv
Output command "ds.substances" outputs "{<Substance O{solv}{x=1.000000}>}" Here is link to the ThermoML report of this specific example paper proving there are more: https://trc.nist.gov/ThermoML/10.1016/j.fluid.2006.09.025.html
Computing environment (please complete the following information):
conda list
:Additional context I believe the problem is that the classmethod "from_xml_node" in the thermoml.py is not correctly identifying the xml identifiers so it cannot convert StandardInChI to smiles, for example.