This is unfortunately a mess of changes, but this is what allowed us to get the 16k fingerprint model running. I've tried to summarize here:
Improve sweep setup, though I didn't use it so far
Figure out an optimal set up for training throughput and performance: A10G GPU on lightning.ai, compute fingerprints in the first epoch and cahce to disk, prefetch 3 batches, and use LR decay on plateua with 0.1-0.25 factor, 5 epoch patience.
Fix some bugs in training script including jsonifying metrics for serialization
Add solvent, agent and overall accuracy metrics for validation.
Make sure train_fraction is taken on the training set only
This is unfortunately a mess of changes, but this is what allowed us to get the 16k fingerprint model running. I've tried to summarize here: