Adds a LazyRegressionDataset and model_type=regression_lazy, which is like the existing regression / MTR pretraining mode, except that it takes a .smi file and computes RDKit descriptors on-the-fly as part of the dataset's preprocess() method.
Adds a script compute_norms.py that takes in a .smi file and returns a .json with the mean and std of the descriptors for use in MTR pretraining.
Fixes a nasty bug affecting RawTextDataset where the model inputs included additional non-SMILES characters. See below.
This PR has several major improvements:
LazyRegressionDataset
andmodel_type=regression_lazy
, which is like the existingregression
/ MTR pretraining mode, except that it takes a .smi file and computes RDKit descriptors on-the-fly as part of the dataset'spreprocess()
method.compute_norms.py
that takes in a .smi file and returns a .json with the mean and std of the descriptors for use in MTR pretraining.RawTextDataset
where the model inputs included additional non-SMILES characters. See below.