uiocompcat / tmcinvdes

Public repository for preprint "Deep Generative Model for the Dual-Objective Inverse Design of Metal Complexes"
MIT License
2 stars 0 forks source link

Cleanup xyz to SMILES functionality #11

Closed Strandgaard96 closed 4 months ago

Strandgaard96 commented 4 months ago

These changes should all make it easier to read and understand the workflow. RDKit version disclaimer to be transparent about reproducibility issues.

tlinjordet commented 4 months ago

Hi @Strandgaard96,

Thanks for this. I think all the docstring improvements are very good and helpful, as well as the clarification about sensitivity to RDkit version.

Note that it would be easier to review smaller, independently mergeable PRs, especially for the all the minor edits.

This PR required perhaps surprisingly a fair bit of testing, hence the detailed message here and a fairly extensive Jupyter notebook to go along with this message. I hope you will read this message carefully and also look carefully (read and run cells) through the notebook on the private repo: tmcinvdes_PR11_reproduction_explorative_analysis_and_repair_attempt.ipynb now placed under test_notebooks in the private repo. It should run as-is from the updated private repo, except updating the local path to a clone of the tmQMg-L repo.

I hope some of the things I tried in that notebook can be helpful or inspiring, and I hope you push any relevant further work, e.g., modifying or extending that notebook as a sibling notebook in the private repo. There are some kekulize errors and other things that you know better than me, and if you see solutions I would be very interested.

All this said, improving the reproduction accuracy of (0->1) is not that high of a priority, but we should make sure not to lose accuracy either.

1) Renaming select_initial_ligands.py into get_substituted_ligands.py:

For renaming the Python script or any of the directory/filenames explicitly laid out in the plans, I would prefer to discuss it, e.g., by email, before a PR, since it will require an update to the plan and possible downstream renaming tasks, etc.

2) tmcinvdes/ligand_generation/README.md and RDkit version:

Great! We should probably include the denticity in the example, like so:

python get_substituted_smiles.py -d monodentate -i ~/git/tmQMg-L -o output_dir

Based on reproduction testing, it looks like RDkit version and build (from conda list | grep rdkit)

rdkit                     2023.09.1        py39hce5ca95_0    conda-forge

is the one originally used for the bidentates, while 2023.03.3 was presumably the one used for the monodentates.

Thus it seems RDkit 2023.09.1 is right for reproducing the bidentate ligand selection (and partly the encoding/enrichment), and that RDkit 2023.09.1 is also closer than RDkit 2023.03.3 to being the best choice for reproducing tmQMg-L_mono.txt.

For now I think it is better to keep monodentate_training.csv, and I think both bidentate_training.csv and tmQMg-L_bi_with_ids.csv are superseded by the tmQMg-L_bi.csv that is produced by running select_initial_ligands.py/get_substituted_ligands.py with version RDkit 2023.09.1.

3) Further issues with tmQMg-L_mono_with_ids.csv

One quick way to see this (from Jupyter notebook tmcinvdes_PR11_reproduction_explorative_analysis_and_repair_attempt.ipynb now placed under test_notebooks in the private repo):

df_orig_mono_csv = pd.read_csv("../datasets/01_tmQMg-L-training_sets/tmQMg-L_mono_with_ids.csv", low_memory=False,
                     usecols=lambda c: not c.startswith('Unnamed:'))

for ligand_id in df_orig_mono_csv.name.values.tolist():
    df_temp = df_orig_mono_csv[df_orig_mono_csv.name == ligand_id]
    if df_temp.shape[0] > 1:
        print(ligand_id)
        print(df_temp.enriched_smiles.values.tolist())
        print("")

Make sure to check for this type of thing!

I think re-capturing the last missing monodentate ligands is a good thing, and I suspect that the solution will lie in using these two facts:

It could also be worth noting:

Ultimately, the old monodentate_training.csv was closer to tmQMg-L_mono.txt than tmQMg-L_mono_with_ids.csv.

The jupyter notebook goes further than this, but I didn't find anything conclusive yet.

4) tmQMg-L ligand IDs

One note on the docstring(s) mentioning "label" of tmQMg-L ligands. The plan is to use "tmQMg-L ligand IDs" as a reserved key term for the ID of ligands used in tmQMg-L instead of "label" since we have already chosen to used "labeled" as a short form designation that homoleptic TMCs and by extension their constituent ligands being labeled in the supervised machine learning sense with the DFT/ORCA-calculated target properties.

Strandgaard96 commented 4 months ago

@tlinjordet Perhaps adding the files could have been split into a separate PR, good suggestion, noted for future PRs :)

1) Renaming select_initial_ligands.py into get_substituted_ligands.py: I agree that perhaps the name could be a bit more clear. I thought the term "substituted" was used in the preprint. Coming from the preprint i think "get_encoded_ligands.py" should be very clear for the reader as this is what we use in the preprint. Also whatever we name the columns referring to the encoded smiles, I think the reader would be able to figure out that this is the encoded ligands given the presence of Be-Li-Ir elements.

When it comes to renaming things, i think it is a too slow process if we need to confer everytime we make a small change like this. It should be clear to the reader, that is all that matters. This is also quite dynamic and I think having a set plan is not the way to go as we will probably rename things many times through iterations and it will take too much time.

2) tmcinvdes/ligand_generation/README.md and RDkit version:

I am positive that the same version was used for both training sets. And it should be 2023.03.3 for both. I think spending more time on investigating this is not the way to go. I looked into the notebook and can see the issues, but this is a timesink we can spend so much time on. It is an issue with RDKit and reproducibility and we can just make the reader aware of this.

3) Further issues with tmQMg-L_mono_with_ids.csv

Thanks for spotting this! I think the way i retrieved the ligand IDs had a bug. Removed these csv files from the PR and will investigate. Again, this is not of paramount importance as the previous csv files monodentate_training.csv and bidentate_training.csv there were very close to the original training sets, bar 15 monodentate ligands or so. I just rolled back the datasets and can return to solving this at a later stage as we should be fine with the monodentate_training.csv and bidentate_training.csv files for now.

4) tmQMg-L ligand IDs

I think it is quite clear as of now what the ligand label refers to. We can change this at a later time if needed.