In the MOABB benchmarks, the train.py script logs the evaluation metric for the averaged model (as defined in the hparams) for both the test and validation datasets at the end of the train_log.txt file (and outputs to console).
However, due to the implementation of sb.Brain and the train_logger used, both metrics appear as test metrics. Notice how the two outputs are only distinguished by the numbers and are otherwise identical despite representing very different things:
...
epoch loaded: 264 - test loss: 1.30, test f1: 5.41e-01, test acc: 5.42e-01, test cm: [[36 8 13 15]
[10 44 10 8]
[ 9 18 34 11]
[ 5 20 5 42]]
epoch loaded: 264 - test loss: 1.97e-01, test f1: 5.44e-01, test acc: 5.54e-01, test cm: [[11 2 0 1]
[ 2 7 2 3]
[ 3 2 5 4]
[ 2 2 2 8]]
This makes it confusing when reviewing the log. Should we fix this, or else add a note about it in the README?
In the MOABB benchmarks, the
train.py
script logs the evaluation metric for the averaged model (as defined in the hparams) for both the test and validation datasets at the end of thetrain_log.txt
file (and outputs to console).https://github.com/speechbrain/benchmarks/blob/ccc0d63a0a3275bd40fc603ccbb962fbfdaff260/benchmarks/MOABB/train.py#L297-L302
However, due to the implementation of sb.Brain and the train_logger used, both metrics appear as test metrics. Notice how the two outputs are only distinguished by the numbers and are otherwise identical despite representing very different things:
This makes it confusing when reviewing the log. Should we fix this, or else add a note about it in the README?