The train_trainsformer.py (hyperdrive) script evaluating the model based on F1 score, however, the AutoML currently doesn't support this metric. So a quick fix is to change the hyperdrive evaluation method to AUC.
No evaluation on test set is possible at this time, there needs to be subsequent step which evaluates the model against the test set.