Bare minimum repo - Githubissues

Strandgaard96 commented 4 months ago

I added the relevant data files used to create all the plots in the preprint main text. Also added the functionality to create the training sets and a link to the JT-VAE forked repo. Added conditional optimization results for the monodentate generator. All this should be available to people now and then we can always build on this. Also simplified README.md

tlinjordet commented 4 months ago

Thanks for your efforts @Strandgaard96!

Note: I was not able to run the code as-is with the present instructions. I have investigated how to make the code run and have included solutions (and recommendations where appropriate) in the list of fixes below.

Please make and push the following fixes on the current feature branch+PR, then I will test and review again:

Revert subdirectory and filename changes under datasets back to the planned naming system, instead add bullet points with a brief explanation+pointer about the files for (respectively) training sets, trained models, and generated ligands as needed in datasets/README.md. Also, remember to link to datasets/DETAILS.md (please rename detailed_info.md) from datasets/README.md.
Make sure environment.yml here matches the common environment from the private development repo. Update the latter environment.yml if needed by loading that environment, installing extra packages, and then exporting to file with package versions (but not builds) to the updated environment.yml both here and there. Let us keep the environment name tmcinvdes.
- Note that xyz2mol installation works with conda installation from conda-forge (conda install -c conda-forge xyz2mol), perhaps it would be sufficient to add xyz2mol to the common environment.yml? If not, see how the uxtbpy package is added by the common environment.yml by pointing at the uxtbpy GitHub repository within the environment.yml pip clause. This could even be extended to point to a specific commit of the repository to be pip install'ed. Relying on conda would be preferable, however.
Remember to pre-commit run --all-files until all hooks pass with our common configuration before pushing changes.
Avoid hard-coded file paths in the Python files, especially to missing non-public files. For files under tmcinvdes/ligand_generation:
- select_initial_ligands.py, utils.py, and xyz2mol_functionality.py have hard-coded paths or have placeholders for hard-coded paths that require user edits.
- In utils.py hard-coded paths include one pointing to a local clone of the private repo xyz2mol_tm, specifically the file ligands_dict_xyz.json. Also, paths to both tmQMg-L and tmqmg-l, where the latter might be a modified version of former. Unclear what differences are expected here, but all necessary file paths can be set as function arguments and passed from argparse.input_dir-derived paths when the utility functions are called by running select_initial_ligands.py as __main__.
- As is, select_initial_ligands.py and its local dependencies do not run as described. Possibly something like the following version of the function load_ligand_xyz in ligand_generation/utils.py would work:

def load_ligand_xyz(ligand_xyzs_path):
    # load ligand xyz dict
    xyzs_dict = {}
    with open(ligands_xyzs_path, "r") as f_in:
        for xyz in f_in.read().split("\n\n"):
            key = xyz.split("\n")[1]
            xyzs_dict[key]=xyz
    return xyzs_dict

(continuing:)
- Here ligand_xyzs_path could be derived from argparse.input_dir, assuming the necessary file could be read from the local clone of the tmQMg-L repository.
- If the xyzs file to be parsed into a dictionary is not among the .xyz files in the tmQMg-L repository or easily derivable from those, then ligand_xyzs_path should be given via another argparse flag in select_initial_ligands.py.
- Please make sure to use the filenames and schema as planned in the detailed datasets plan, and format-wise please use the "plot-ready," capitalized and spaced format for column names and dictionary keys when creating these as literal strings. Remember to write the non-training-data (i.e., not .txt files) dataframe to file with df.to_csv(header=True, index=False) as described at the bottom in datasets/DETAILS.md.
- In tmcinvdes/quantum_chemistry/orca_parsers.py please change line 13 from .cm5 import get_cm5 into from cm5 import get_cm5. This makes orca_parsers.py run from its subdirectory as described under Usage in the docstring. Please add such a usage example to tmcinvdes/quantum_chemistry/README.md in the same way as tmcinvdes/ligand_generation/README.md.

For now we can defer making the planned Bash approach work with the Python scripts, but these may later affect some of the import statements to let us run Python scripts tmcinvdes/subdir/script.py by the pattern python -m tmcinvdes.subdir.script -i input_file ....

Strandgaard96 commented 4 months ago

@tlinjordet Appreciate the detailed review!

I cleaned the paths a bit. I still think that hardcoded paths are fine if the user knows what to fill in instead. We just need to get the code out there ASAP.

Fixed most of these things and select_initial_ligands.py should be runnable as the ligand_dict is created on the fly.

I think the naming of the folders is more clear in this way. Keeping the README.md as clean as possible is important. With these names the user can immediately identify where the training sets, models, and generated ligands are, which is better IMO.

The column labels should be mostly self-explanatory. Plot readiness is something we can implement in the next steps and should be low priority. As long as it is clear what a column is then we can always build on that.

I dont see anywhere where i write to .txt files?

Will check a new user install tomorrow morning and then we should merge.

Strandgaard96 commented 4 months ago

Can confirm that installing this version from scratch works if path tmqmg-l is set correct.

uiocompcat / tmcinvdes

Bare minimum repo #3