Cleanup xyz to SMILES functionality

Strandgaard96 commented 4 months ago

Added csv files with ligand id for each ligand
Changed function name and added disclaimer in README.md
Added/modified documentation and removed unnecessary for-loop
Minor documentation adjustments
Renamed function to make it more clear. More documentation and cleanup.

These changes should all make it easier to read and understand the workflow. RDKit version disclaimer to be transparent about reproducibility issues.

tlinjordet commented 4 months ago

Hi @Strandgaard96,

Thanks for this. I think all the docstring improvements are very good and helpful, as well as the clarification about sensitivity to RDkit version.

Note that it would be easier to review smaller, independently mergeable PRs, especially for the all the minor edits.

This PR required perhaps surprisingly a fair bit of testing, hence the detailed message here and a fairly extensive Jupyter notebook to go along with this message. I hope you will read this message carefully and also look carefully (read and run cells) through the notebook on the private repo: tmcinvdes_PR11_reproduction_explorative_analysis_and_repair_attempt.ipynb now placed under test_notebooks in the private repo. It should run as-is from the updated private repo, except updating the local path to a clone of the tmQMg-L repo.

I hope some of the things I tried in that notebook can be helpful or inspiring, and I hope you push any relevant further work, e.g., modifying or extending that notebook as a sibling notebook in the private repo. There are some kekulize errors and other things that you know better than me, and if you see solutions I would be very interested.

All this said, improving the reproduction accuracy of (0->1) is not that high of a priority, but we should make sure not to lose accuracy either.

1) Renaming `select_initial_ligands.py` into `get_substituted_ligands.py`:

For renaming the Python script or any of the directory/filenames explicitly laid out in the plans, I would prefer to discuss it, e.g., by email, before a PR, since it will require an update to the plan and possible downstream renaming tasks, etc.

I would say that I partly agree with the point of this change, i.e., this (0->1) script does select the initial ligands used in the training of unconditional JT-VAE models, but not only; it also encodes them in the enriched SMILES format. Also, this encoding is a bit of a filter since not all ligands from tmQMg-L with the right denticity, hapticity, and charge criteria, survive the enrichment part of the process.
I think it would be confusing to use the term "substituted" here since in the code and plan for filenames and data file .csv column names we mostly refer to the encoding of SMILES as "Enriched SMILES" (or enriched_smiles). I know there are utility functions referencing substitution, but I think it would be clearer to have consistent terminology at least on the top level, i.e., filenames, .csv column names, etc.
One of the reasons I prefer "enriched" over "substituted" is that the former has a clearly additive meaning, i.e., we add the Li, Be, or Ir, whereas substitution implies replacing a thing with another thing. This may make sense for us too, but in the encoded form of SMILES, the additive nature of the encoding is manifest and evident, and the reversal of the encoding is clearly subtractive.
If we instead go with "substituted" here I think it should for consistency and clarity be rolled out into the .csv column headers as well, i.e., "Substituted SMILES" instead of "Enriched SMILES", but I am not convinced and it would be extra work. Another reason to keep "Enriched SMILES" as a fixed key term. If so, we could also consider a more sober term than "Canonical SMILES".
If we do want to change select_initial_ligands.py---and I appreciate the idea of imbuing the script filename with a meaning that gives a more complete picture of the process---I would prefer alternatives like get_enriched_smiles.py, get_enriched_ligands.py, select_and_enrich_ligands.py, or something else over get_substituted_ligands.py.

2) `tmcinvdes/ligand_generation/README.md` and RDkit version:

Great! We should probably include the denticity in the example, like so:

python get_substituted_smiles.py -d monodentate -i ~/git/tmQMg-L -o output_dir

Based on reproduction testing, it looks like RDkit version and build (from conda list | grep rdkit)

rdkit                     2023.09.1        py39hce5ca95_0    conda-forge

is the one originally used for the bidentates, while 2023.03.3 was presumably the one used for the monodentates.

With 2023.03.3, the selected and enriched set of bidentate ligands is 36 less than with 2023.09.1. Unclear why I missed this during review of PR #9 , but switching back and forth between that environment and the one I still had when reviewing PR #3 now, both for the code on main and the dev branch PR'ed here, I am getting the same bidentate ligand IDs not being selected repeatedly with RDkit 2023.03.3:

{'ligand10439-0', 'ligand11665-0', 'ligand12086-1', 'ligand12629-0', 'ligand13127-0', 'ligand13779-0',
'ligand1409-0', 'ligand14551-0', 'ligand17680-0', 'ligand18632-0', 'ligand18706-0', 'ligand19170-0',
'ligand20323-1', 'ligand20343-0', 'ligand2041-0', 'ligand20946-0', 'ligand2163-0', 'ligand22578-0',
'ligand25308-0', 'ligand27495-0', 'ligand27563-0', 'ligand29586-1', 'ligand29750-0', 'ligand29762-0',
'ligand31076-0', 'ligand31177-0', 'ligand31608-0', 'ligand31838-0', 'ligand4170-0', 'ligand4216-0',
'ligand4332-0', 'ligand4560-0', 'ligand4927-0', 'ligand5388-0', 'ligand6156-0', 'ligand7054-0'}

There are no extraneous ligand IDs being selected, only 36 missing.

With 2023.03.3, the selected and enriched set of monodentate ligands is the same size as the set used during training when reproducing with current code, but 81 ligand IDs differ between tmQMg-L_mono_with_ids.csv and the tmQMg-L_mono.csv that is output with get_substituted_smiles.py in this PR's dev branch. In contrast, only one ligand ID differs between the current output for monodentates and monodentate_training.csv when using RDkit version 2023.09.1.

Thus it seems RDkit 2023.09.1 is right for reproducing the bidentate ligand selection (and partly the encoding/enrichment), and that RDkit 2023.09.1 is also closer than RDkit 2023.03.3 to being the best choice for reproducing tmQMg-L_mono.txt.

For now I think it is better to keep monodentate_training.csv, and I think both bidentate_training.csv and tmQMg-L_bi_with_ids.csv are superseded by the tmQMg-L_bi.csv that is produced by running select_initial_ligands.py/get_substituted_ligands.py with version RDkit 2023.09.1.

3) Further issues with `tmQMg-L_mono_with_ids.csv`

I note that tmQMg-L_mono_with_ids.csv is presumably an attempt to identify the closest tmQMg-L ligand IDs, where the old monodentate_training.csv differed from the ground truth tmQMg-L_mono.txt by some 15 unidentified ligands.
However, in the new tmQMg-L_mono_with_ids.csv file in this PR, it seems that the same tmQMg-L ligand ID has been repeatedly selected as the origin of various enriched SMILES in tmQMg-L_mono.txt. Thus we have fewer unique tmQMg-L ligand IDs in tmQMg-L_mono_with_ids.csv (3076) than there are enriched SMILES in tmQMg-L_mono.txt (4511).

One quick way to see this (from Jupyter notebook tmcinvdes_PR11_reproduction_explorative_analysis_and_repair_attempt.ipynb now placed under test_notebooks in the private repo):

df_orig_mono_csv = pd.read_csv("../datasets/01_tmQMg-L-training_sets/tmQMg-L_mono_with_ids.csv", low_memory=False,
                     usecols=lambda c: not c.startswith('Unnamed:'))

for ligand_id in df_orig_mono_csv.name.values.tolist():
    df_temp = df_orig_mono_csv[df_orig_mono_csv.name == ligand_id]
    if df_temp.shape[0] > 1:
        print(ligand_id)
        print(df_temp.enriched_smiles.values.tolist())
        print("")

Make sure to check for this type of thing!

I think re-capturing the last missing monodentate ligands is a good thing, and I suspect that the solution will lie in using these two facts:

The selection and enrichment loop is orderly, so each tmQMg-L ligand ID is used in increasing order to potentially add to the selected and enriched subset.
Each tmQMg-L ligand ID is tried only once to obtain an enriched SMILES string.

It could also be worth noting:

Of the 4511 enriched SMILES in tmQMg-L_mono.txt, there are 3 duplicates, so only 4508 unique strings are in tmQMg-L_mono.txt. The duplicated enriched SMILES are [Li]<-n1cc([N+](=O)[O-])n(CCO)c1C, [Li]<-n1cc([N+](=O)[O-])n(CCS(=O)(=O)CC)c1C, [Li]<-P1(c2c(C(F)(F)F)cccc2C(F)(F)F)OCCO1.

Ultimately, the old monodentate_training.csv was closer to tmQMg-L_mono.txt than tmQMg-L_mono_with_ids.csv.

The jupyter notebook goes further than this, but I didn't find anything conclusive yet.

4) tmQMg-L ligand IDs

One note on the docstring(s) mentioning "label" of tmQMg-L ligands. The plan is to use "tmQMg-L ligand IDs" as a reserved key term for the ID of ligands used in tmQMg-L instead of "label" since we have already chosen to used "labeled" as a short form designation that homoleptic TMCs and by extension their constituent ligands being labeled in the supervised machine learning sense with the DFT/ORCA-calculated target properties.

Of course the code that interacts with the tmQMg-L data must use the string "label" as a an access key of the native column headers from that dataset, but that can probably be the extent of it.

Strandgaard96 commented 4 months ago

@tlinjordet Perhaps adding the files could have been split into a separate PR, good suggestion, noted for future PRs :)

1) Renaming select_initial_ligands.py into get_substituted_ligands.py: I agree that perhaps the name could be a bit more clear. I thought the term "substituted" was used in the preprint. Coming from the preprint i think "get_encoded_ligands.py" should be very clear for the reader as this is what we use in the preprint. Also whatever we name the columns referring to the encoded smiles, I think the reader would be able to figure out that this is the encoded ligands given the presence of Be-Li-Ir elements.

When it comes to renaming things, i think it is a too slow process if we need to confer everytime we make a small change like this. It should be clear to the reader, that is all that matters. This is also quite dynamic and I think having a set plan is not the way to go as we will probably rename things many times through iterations and it will take too much time.

2) tmcinvdes/ligand_generation/README.md and RDkit version:

I am positive that the same version was used for both training sets. And it should be 2023.03.3 for both. I think spending more time on investigating this is not the way to go. I looked into the notebook and can see the issues, but this is a timesink we can spend so much time on. It is an issue with RDKit and reproducibility and we can just make the reader aware of this.

3) Further issues with tmQMg-L_mono_with_ids.csv

Thanks for spotting this! I think the way i retrieved the ligand IDs had a bug. Removed these csv files from the PR and will investigate. Again, this is not of paramount importance as the previous csv files monodentate_training.csv and bidentate_training.csv there were very close to the original training sets, bar 15 monodentate ligands or so. I just rolled back the datasets and can return to solving this at a later stage as we should be fine with the monodentate_training.csv and bidentate_training.csv files for now.

4) tmQMg-L ligand IDs

I think it is quite clear as of now what the ligand label refers to. We can change this at a later time if needed.

uiocompcat / tmcinvdes