lucpaoli / SAFT_ML

2 stars 0 forks source link

Generate large dataset of molecules #11

Closed lucpaoli closed 10 months ago

lucpaoli commented 11 months ago

One possibility:

longemen3000 commented 11 months ago

https://doi.org/10.1016/j.molliq.2023.122480 has in its SA a database with about 1600 components, smiles, CAS, critical, translation and twu parameters, and more importantly, UNIFAC group fragmentations. (all of those, except the group fragmentations, are in the latest Clapeyron release). My idea is to use those as a test bed for GCIdentifier.jl, our main problem there right now is the lack of adequate SMARTS queries to match some groups, but the algorithm seems fine at the moment

lucpaoli commented 11 months ago

@longemen3000 First of all, that's a fantastic resource - thank you very much. In a related effort, I'm trying to load in the data from this paper: https://pubs.acs.org/doi/10.1021/acs.iecr.3c02255?goto=supporting-info into Clapeyron-compatible database files so I can use it to generate synthetic data for a wide range of molecules.

In particular, I'm not sure where to map the kappa_ab and epsilon_k_ab to, as the example db file in Clapeyron seems to require explicit specification of which sites are interacting, as well as species1 & species2 - while we're just trying to model pures. Do you have any ideas @longemen3000, @pw0908 ? Thank you!

This is the SI file: SI_pcp-saft_parameters.csv It has columns: common_name iupac_name inchi canonical_smiles isomeric_smiles cas family molarweight m sigma epsilon_k mu kappa_ab epsilon_k_ab na nb mard_psat t_min_psat t_max_psat points_psat mard_psat_incl_outlier points_psat_incl_outlier mard_density t_min_density t_max_density points_density_liquid mard_density_liquid_single_phase points_density_liquid_single_phase mard_density_equi points_density_equi mard_density_vapor points_density_vapor opt bounds_violation

The code I'm currently using is here:

using CSV, DataFrames, DelimitedFiles
df = CSV.read("SI_pcp-saft_parameters.csv",DataFrame,header=1) ;

# Info for Clapeyron Database File:
# species Mw segment sigma epsilon dipole n_H n_e source
num_rows = nrow(df)
df2 = DataFrame(
    species = fill(missing, num_rows),
    Mw = fill(missing, num_rows),
    segment = fill(missing, num_rows),
    sigma = fill(missing, num_rows),
    epsilon = fill(missing, num_rows),
    dipole = fill(missing, num_rows),
    n_H = fill(missing, num_rows),
    n_e = fill(missing, num_rows),
    source = fill(missing, num_rows)

);

df2.species = df.common_name
df2.Mw = df.molarweight
df2.segment = df.m
df2.sigma = df.sigma
df2.epsilon = df.epsilon_k
df2.dipole = df.mu
df2[!, :source] .= "10.1021/acs.iecr.3c02255"
pw0908 commented 11 months ago

I believe Andrés is extracting that database for Clapeyron as we speak. Maybe wait a few days?

longemen3000 commented 11 months ago

I have the db already integrated in a branch, ready to merge and release a new version

lucpaoli commented 11 months ago

Yup, I see that's released! Thank you so much @longemen3000

MichaelGadaloff commented 10 months ago

Generated 500 sat. temperature, pressure, and vap + liq vol points for 1839 compounds using PCP-SAFT. Stored data in a CSV with polarity, functional group family, SMILES, CAS, molar weight, critical conditions (predicted), number of source experimental data, and AAD from experimental data.

longemen3000 commented 10 months ago

If you use tcPR (with estimate_alpha = false and estimate_translation = false), the saturation liquid volume at Tr= 0.8 is the experimental one