MOABB - train.py logs averaged model validation metrics as test metrics

In the MOABB benchmarks, the train.py script logs the evaluation metric for the averaged model (as defined in the hparams) for both the test and validation datasets at the end of the train_log.txt file (and outputs to console).

https://github.com/speechbrain/benchmarks/blob/ccc0d63a0a3275bd40fc603ccbb962fbfdaff260/benchmarks/MOABB/train.py#L297-L302

However, due to the implementation of sb.Brain and the train_logger used, both metrics appear as test metrics. Notice how the two outputs are only distinguished by the numbers and are otherwise identical despite representing very different things:

...
epoch loaded: 264 - test loss: 1.30, test f1: 5.41e-01, test acc: 5.42e-01, test cm: [[36  8 13 15]
 [10 44 10  8]
 [ 9 18 34 11]
 [ 5 20  5 42]]
epoch loaded: 264 - test loss: 1.97e-01, test f1: 5.44e-01, test acc: 5.54e-01, test cm: [[11  2  0  1]
 [ 2  7  2  3]
 [ 3  2  5  4]
 [ 2  2  2  8]]

This makes it confusing when reviewing the log. Should we fix this, or else add a note about it in the README?

speechbrain / benchmarks

MOABB - train.py logs averaged model validation metrics as test metrics #24