Introduction of a sample_seed in addition to a train_seed (previously just referred to as seed). train_seeds indicate the separate models that are trained for each fold, and sample_seeds indicate the separate instances of sampling that are done for each such model. Each instance of structural_prior/formula_prior gets its data from its specific sample_seed. See https://raw.githubusercontent.com/vineetbansal/CLM/4c0c762b76fca7ac422014b62da4c1d0d588c3bc/snakemake/dag.png
InchI keys are used in the pre-processing step to generate unique smiles from the input dataset for the train/test dataset generation. InchI keys are used nowhere else currently, so this possibly needs more work.
EDIT: tabulate_molecules (tabulate frequencies) uses InchI keys for grouping in Michael's scripts.
EDIT: remove training sets smiles based on InchI key rather than canonical smiles.
Support for ecfp6 fingerprints in the write_nn_Tc command, through a flag. Uses AllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=1024) for fingerprint generation if enabled, or Chem.RDKFingerprint(mol) if disabled. Disabled by default.
closing this here - I'll open up separate PRs for each of the 3 issues addressed here, so we can better keep track of the changes w.r.t. the baseline snakemake output files for our tests.
A few changes as suggested by @skinnider :
Introduction of a
sample_seed
in addition to atrain_seed
(previously just referred to asseed
).train_seed
s indicate the separate models that are trained for each fold, andsample_seed
s indicate the separate instances of sampling that are done for each such model. Each instance ofstructural_prior
/formula_prior
gets its data from its specificsample_seed
. See https://raw.githubusercontent.com/vineetbansal/CLM/4c0c762b76fca7ac422014b62da4c1d0d588c3bc/snakemake/dag.pngInchI keys are used in the pre-processing step to generate unique smiles from the input dataset for the train/test dataset generation. InchI keys are used nowhere else currently, so this possibly needs more work.
tabulate_molecules
(tabulate frequencies) uses InchI keys for grouping in Michael's scripts.Support for
ecfp6
fingerprints in thewrite_nn_Tc
command, through a flag. UsesAllChem.GetMorganFingerprintAsBitVect(mol, 3, nBits=1024)
for fingerprint generation if enabled, orChem.RDKFingerprint(mol)
if disabled. Disabled by default.