Pretrain : Visdom plots and evaluation metrics

Use OR-Tools to provide a training set to serve as a base for the policy to imitate. OR-Tools is set to solve the problem using the averagistic strategy.

Follow actor train/eval loss during pretraining on visdom.
Able to pretrain on a specific instances and finetune the model using PPO onto those specific instances. To do this you have to make sure to mark the instances as fixed for both training and validation.
New training set can be generated "online" for each new epoch or "offline" at the start of the pretraining phase. For fixed problem, the generation should be offline since the solutions would be the same everytime they are solved. Either way, you have to specify the generation strategy in the arguments.
Best pretrained model (based on its evaluation loss) is saved locally to a file named pretrain.pkl in the training directory path.
Added a --retrain PATH argument, that load the weights of the model pointed by the PATH value. This is slightly different than the actual --resume that simply load the weights of the agent.pkl in the current training directory path.
Added weight decay for pretraining and for PPO training. The two corresponding args are --pretrain_weight_decay and --weight_decay.
Code for training the critic is provided but deactivated if the critic loss coefficient is set to $0$. This part of the pretraining is not validated and should be tested further to make sure it is useful.
Fixed some bugs in the pretraining loop.

Example of args to test:

python3 train.py --n_j 4 --n_m 4 --max_n_j 20 --max_n_m 20 --total_timesteps 100000000000 --n_validation_env 50 --fixed_validation --n_steps_episode 160 --n_workers 10 --batch_size 160 --lr 1e-4 --exp_name_appendix pretrain --seed 1 --optimizer adam --target_kl 0.04 --ent_coef 0.05 --n_epochs 20 --device cuda:0 --fe_type dgl --residual_gnn --graph_has_relu --graph_pooling learn --hidden_dim_features_extractor 32 --n_layers_features_extractor 5 --mlp_act gelu --layer_pooling last --n_mlp_layers_features_extractor 1 --n_mlp_layers_actor 1 --n_mlp_layers_critic 1 --hidden_dim_actor 16 --hidden_dim_critic 16 --pretrain --pretrain_num_envs 100 --pretrain_num_eval_envs 10 --pretrain_dataset_generation online --pretrain_prob 0.9 --pretrain_epochs 100 --pretrain_batch_size 128 --pretrain_lr 1e-5 --pretrain_weight_decay 1e-1

jolibrain / wheatley

Pretrain : Visdom plots and evaluation metrics #52