hille721 / solvatum

Solv@TUM - The Solvation Free Energy Database
http://doi.org/10.14459/2018mp1452571.001
Creative Commons Attribution Share Alike 4.0 International
9 stars 5 forks source link

CSV File Version of Database #2

Open JacksonBurns opened 3 months ago

JacksonBurns commented 3 months ago

Thank you for this excellent resource! I work with machine learning and most often we only need SMILES and target values for datasets like this. To that end I have created a CSV file that contains the SMILES strings and free energy of solvation values for this dataset and thought it might useful to others, so I have attached it to this comment: solvatum.csv

I will also post the code used below to produce the data, in case other find it useful (I used Python 3.8 and the version of solvatum at this GitHub: https://github.com/wwang2/solvatum):

from solvatum.ui import Database
from py2opsin import py2opsin

d = Database()
with open("solvatum.csv", "w") as file:
    file.write("solute_smiles,solvent_smiles,dG_solv\n")
    for solvent_name in d.solvents:
        try:
            solvent_smiles = d.get_molecule_properties(solvent_name)["SMILES"]
        except Exception as e:
            print(str(e))
            solvent_smiles = py2opsin(solvent_name)
            if not solvent_smiles:
                solvent_smiles = input(f"Could not find SMILES for {solvent_name}, please provide it:")
        for solute_name in d.solutes:
            try:
                result = d.filtering(solvent=solvent_name, solute=solute_name)["deltaG_solv"]
            except KeyError:
                print("Missing", solute_name, "in", solvent_name)
                continue
            try:
                solute_smiles = d.get_molecule_properties(solute_name)["SMILES"]
            except Exception as e:
                print(str(e))
                solute_smiles = py2opsin(solute_name)
                if not solvent_smiles:
                    solute_smiles = input(f"Could not find SMILES for {solute_name}, please provide it:")
            file.write(f"{solute_smiles},{solvent_smiles},{result}\n")
hille721 commented 3 months ago

Hi @JacksonBurns,

great to see that this is still useful for others! Will be happy if you will also cite our background paper https://doi.org/10.1063/1.5050938.

I guess you are using the version of @wwang2 due to the lack of Python3 support in the original code base. Unfortunately he was not coming back to his PR https://github.com/hille721/solvatum/pull/1. If you could verify if the changes in this PR will work for you, then I can merge it and we have the support also here. (I'm not working in this area anymore and can only do basic validation).

If you want, you can also create a PR with your code as an additional function (e.g. to_csv) thus everyone would directly benefit.

Anyway, many thanks for your sharing, will leave this issue open, thus everyone with interest will see it.

JacksonBurns commented 3 months ago

The version in the linked PR worked fine for me so you can probably just go ahead and merge it. The generated CSV file is included in my initial comment, so I won't add it to the codebase.

hille721 commented 3 months ago

Thanks for the verification, the PR #1 is merged now, thus the Python3 compatibility should be given now.