speechbrain / benchmarks

This repository contains the SpeechBrain Benchmarks
Apache License 2.0
83 stars 35 forks source link

MOABB - train.py logs averaged model validation metrics as test metrics #24

Closed Drew-Wagner closed 5 months ago

Drew-Wagner commented 6 months ago

In the MOABB benchmarks, the train.py script logs the evaluation metric for the averaged model (as defined in the hparams) for both the test and validation datasets at the end of the train_log.txt file (and outputs to console).

https://github.com/speechbrain/benchmarks/blob/ccc0d63a0a3275bd40fc603ccbb962fbfdaff260/benchmarks/MOABB/train.py#L297-L302

However, due to the implementation of sb.Brain and the train_logger used, both metrics appear as test metrics. Notice how the two outputs are only distinguished by the numbers and are otherwise identical despite representing very different things:

...
epoch loaded: 264 - test loss: 1.30, test f1: 5.41e-01, test acc: 5.42e-01, test cm: [[36  8 13 15]
 [10 44 10  8]
 [ 9 18 34 11]
 [ 5 20  5 42]]
epoch loaded: 264 - test loss: 1.97e-01, test f1: 5.44e-01, test acc: 5.54e-01, test cm: [[11  2  0  1]
 [ 2  7  2  3]
 [ 3  2  5  4]
 [ 2  2  2  8]]

This makes it confusing when reviewing the log. Should we fix this, or else add a note about it in the README?