Use OR-Tools to provide a training set to serve as a base for the policy to imitate. OR-Tools is set to solve the problem using the averagistic strategy.
Follow actor train/eval loss during pretraining on visdom.
Able to pretrain on a specific instances and finetune the model using PPO onto those specific instances. To do this you have to make sure to mark the instances as fixed for both training and validation.
New training set can be generated "online" for each new epoch or "offline" at the start of the pretraining phase. For fixed problem, the generation should be offline since the solutions would be the same everytime they are solved. Either way, you have to specify the generation strategy in the arguments.
Best pretrained model (based on its evaluation loss) is saved locally to a file named pretrain.pkl in the training directory path.
Added a --retrain PATH argument, that load the weights of the model pointed by the PATH value. This is slightly different than the actual --resume that simply load the weights of the agent.pkl in the current training directory path.
Added weight decay for pretraining and for PPO training. The two corresponding args are --pretrain_weight_decay and --weight_decay.
Code for training the critic is provided but deactivated if the critic loss coefficient is set to $0$. This part of the pretraining is not validated and should be tested further to make sure it is useful.
Use OR-Tools to provide a training set to serve as a base for the policy to imitate. OR-Tools is set to solve the problem using the
averagistic
strategy.pretrain.pkl
in the training directory path.--retrain PATH
argument, that load the weights of the model pointed by thePATH
value. This is slightly different than the actual--resume
that simply load the weights of theagent.pkl
in the current training directory path.--pretrain_weight_decay
and--weight_decay
.Example of args to test: