Closed Strandgaard96 closed 4 months ago
Hi @Strandgaard96,
Thanks for this. I think all the docstring improvements are very good and helpful, as well as the clarification about sensitivity to RDkit version.
Note that it would be easier to review smaller, independently mergeable PRs, especially for the all the minor edits.
This PR required perhaps surprisingly a fair bit of testing, hence the detailed message here and a fairly extensive Jupyter notebook to go along with this message.
I hope you will read this message carefully and also look carefully (read and run cells) through the notebook on the private repo:
tmcinvdes_PR11_reproduction_explorative_analysis_and_repair_attempt.ipynb
now placed under test_notebooks
in the private repo.
It should run as-is from the updated private repo, except updating the local path to a clone of the tmQMg-L repo.
I hope some of the things I tried in that notebook can be helpful or inspiring, and I hope you push any relevant further work, e.g., modifying or extending that notebook as a sibling notebook in the private repo. There are some kekulize errors and other things that you know better than me, and if you see solutions I would be very interested.
All this said, improving the reproduction accuracy of (0->1) is not that high of a priority, but we should make sure not to lose accuracy either.
select_initial_ligands.py
into get_substituted_ligands.py
:For renaming the Python script or any of the directory/filenames explicitly laid out in the plans, I would prefer to discuss it, e.g., by email, before a PR, since it will require an update to the plan and possible downstream renaming tasks, etc.
"Enriched SMILES"
(or enriched_smiles
). I know there are utility functions referencing substitution, but I think it would be clearer to have consistent terminology at least on the top level, i.e., filenames, .csv column names, etc."Substituted SMILES"
instead of "Enriched SMILES"
, but I am not convinced and it would be extra work. Another reason to keep "Enriched SMILES"
as a fixed key term. If so, we could also consider a more sober term than "Canonical SMILES"
. select_initial_ligands.py
---and I appreciate the idea of imbuing the script filename with a meaning that gives a more complete picture of the process---I would prefer alternatives like get_enriched_smiles.py
, get_enriched_ligands.py
, select_and_enrich_ligands.py
, or something else over get_substituted_ligands.py
.tmcinvdes/ligand_generation/README.md
and RDkit version:Great! We should probably include the denticity in the example, like so:
python get_substituted_smiles.py -d monodentate -i ~/git/tmQMg-L -o output_dir
Based on reproduction testing, it looks like RDkit version and build (from conda list | grep rdkit
)
rdkit 2023.09.1 py39hce5ca95_0 conda-forge
is the one originally used for the bidentates, while 2023.03.3
was presumably the one used for the monodentates.
With 2023.03.3
, the selected and enriched set of bidentate ligands is 36 less than with 2023.09.1
.
Unclear why I missed this during review of PR #9 , but switching back and forth between that environment and the one I still had when reviewing PR #3 now, both for the code on main and the dev branch PR'ed here, I am getting the same bidentate ligand IDs not being selected repeatedly with RDkit 2023.03.3
:
{'ligand10439-0', 'ligand11665-0', 'ligand12086-1', 'ligand12629-0', 'ligand13127-0', 'ligand13779-0',
'ligand1409-0', 'ligand14551-0', 'ligand17680-0', 'ligand18632-0', 'ligand18706-0', 'ligand19170-0',
'ligand20323-1', 'ligand20343-0', 'ligand2041-0', 'ligand20946-0', 'ligand2163-0', 'ligand22578-0',
'ligand25308-0', 'ligand27495-0', 'ligand27563-0', 'ligand29586-1', 'ligand29750-0', 'ligand29762-0',
'ligand31076-0', 'ligand31177-0', 'ligand31608-0', 'ligand31838-0', 'ligand4170-0', 'ligand4216-0',
'ligand4332-0', 'ligand4560-0', 'ligand4927-0', 'ligand5388-0', 'ligand6156-0', 'ligand7054-0'}
There are no extraneous ligand IDs being selected, only 36 missing.
With 2023.03.3
, the selected and enriched set of monodentate ligands is the same size as the set used during training when reproducing with current code, but 81 ligand IDs differ between tmQMg-L_mono_with_ids.csv
and the tmQMg-L_mono.csv
that is output with get_substituted_smiles.py
in this PR's dev branch. In contrast, only one ligand ID differs between the current output for monodentates and monodentate_training.csv
when using RDkit version 2023.09.1
.
Thus it seems RDkit 2023.09.1
is right for reproducing the bidentate ligand selection (and partly the encoding/enrichment),
and that RDkit 2023.09.1
is also closer than RDkit 2023.03.3
to being the best choice for reproducing tmQMg-L_mono.txt
.
For now I think it is better to keep monodentate_training.csv
, and I think both bidentate_training.csv
and tmQMg-L_bi_with_ids.csv
are superseded by the tmQMg-L_bi.csv
that is produced by running select_initial_ligands.py
/get_substituted_ligands.py
with version RDkit 2023.09.1
.
tmQMg-L_mono_with_ids.csv
tmQMg-L_mono_with_ids.csv
is presumably an attempt to identify the closest tmQMg-L ligand IDs, where the old monodentate_training.csv
differed from the ground truth tmQMg-L_mono.txt
by some 15 unidentified ligands.tmQMg-L_mono_with_ids.csv
file in this PR, it seems that the same tmQMg-L ligand ID has been repeatedly selected as the origin of various enriched SMILES in tmQMg-L_mono.txt
. Thus we have fewer unique tmQMg-L ligand IDs in tmQMg-L_mono_with_ids.csv
(3076) than there are enriched SMILES in tmQMg-L_mono.txt
(4511).One quick way to see this (from Jupyter notebook tmcinvdes_PR11_reproduction_explorative_analysis_and_repair_attempt.ipynb
now placed under test_notebooks
in the private repo):
df_orig_mono_csv = pd.read_csv("../datasets/01_tmQMg-L-training_sets/tmQMg-L_mono_with_ids.csv", low_memory=False,
usecols=lambda c: not c.startswith('Unnamed:'))
for ligand_id in df_orig_mono_csv.name.values.tolist():
df_temp = df_orig_mono_csv[df_orig_mono_csv.name == ligand_id]
if df_temp.shape[0] > 1:
print(ligand_id)
print(df_temp.enriched_smiles.values.tolist())
print("")
Make sure to check for this type of thing!
I think re-capturing the last missing monodentate ligands is a good thing, and I suspect that the solution will lie in using these two facts:
It could also be worth noting:
tmQMg-L_mono.txt
, there are 3 duplicates, so only 4508 unique strings are in tmQMg-L_mono.txt
. The duplicated enriched SMILES are [Li]<-n1cc([N+](=O)[O-])n(CCO)c1C
, [Li]<-n1cc([N+](=O)[O-])n(CCS(=O)(=O)CC)c1C
, [Li]<-P1(c2c(C(F)(F)F)cccc2C(F)(F)F)OCCO1
.Ultimately, the old monodentate_training.csv
was closer to tmQMg-L_mono.txt
than tmQMg-L_mono_with_ids.csv
.
The jupyter notebook goes further than this, but I didn't find anything conclusive yet.
One note on the docstring(s) mentioning "label" of tmQMg-L ligands. The plan is to use "tmQMg-L ligand IDs"
as a reserved key term for the ID of ligands used in tmQMg-L instead of "label" since we have already chosen to used "labeled" as a short form designation that homoleptic TMCs and by extension their constituent ligands being labeled in the supervised machine learning sense with the DFT/ORCA-calculated target properties.
"label"
as a an access key of the native column headers from that dataset, but that can probably be the extent of it.@tlinjordet Perhaps adding the files could have been split into a separate PR, good suggestion, noted for future PRs :)
1) Renaming select_initial_ligands.py into get_substituted_ligands.py: I agree that perhaps the name could be a bit more clear. I thought the term "substituted" was used in the preprint. Coming from the preprint i think "get_encoded_ligands.py" should be very clear for the reader as this is what we use in the preprint. Also whatever we name the columns referring to the encoded smiles, I think the reader would be able to figure out that this is the encoded ligands given the presence of Be-Li-Ir elements.
When it comes to renaming things, i think it is a too slow process if we need to confer everytime we make a small change like this. It should be clear to the reader, that is all that matters. This is also quite dynamic and I think having a set plan is not the way to go as we will probably rename things many times through iterations and it will take too much time.
2) tmcinvdes/ligand_generation/README.md and RDkit version:
I am positive that the same version was used for both training sets. And it should be 2023.03.3 for both. I think spending more time on investigating this is not the way to go. I looked into the notebook and can see the issues, but this is a timesink we can spend so much time on. It is an issue with RDKit and reproducibility and we can just make the reader aware of this.
3) Further issues with tmQMg-L_mono_with_ids.csv
Thanks for spotting this! I think the way i retrieved the ligand IDs had a bug. Removed these csv files from the PR and will investigate. Again, this is not of paramount importance as the previous csv files monodentate_training.csv and bidentate_training.csv there were very close to the original training sets, bar 15 monodentate ligands or so. I just rolled back the datasets and can return to solving this at a later stage as we should be fine with the monodentate_training.csv and bidentate_training.csv files for now.
4) tmQMg-L ligand IDs
I think it is quite clear as of now what the ligand label refers to. We can change this at a later time if needed.
These changes should all make it easier to read and understand the workflow. RDKit version disclaimer to be transparent about reproducibility issues.