Closed cwmeijer closed 3 years ago
I'll have a look at the conflicts now, so I'll turn this into a draft again.
All experiment functions now return sensible results that can be useful to the user. These consist of performance metrics of every step. Because the tests are usually performing only a single update step, with minimal input data, the performance metrics are often trivial. For instance, rank.10 is always 1 because, in the tests, there are no more instances than 10. I, therefore, added a training loss to the results to have the results include a measure very sensitive to logic/code changes. Because different machines resulted in slightly different roundings, I had to use an approximate checker instead of an exact one. Therefore, I chose to include pandas in the test environment.
The step_loss is the training loss at that time step. So, just another performance metric about the current time step. Of course, it is not the most useful performance metric ever, but it is somewhat informative for a user, and of course useful for the test. If you don't agree, we can look for other solutions.
No, I agree. I didn't really think about the degenerate scores we get when testing with a just one batch. With this, it makes sense to add the last training loss.
Is there a specific reason why you don't add it in asr.py
though?
Checking the code again, I think step_loss
is actually a bit misleading. This is actually the mean training loss over the whole epoch right?
This PR adds an assert statement to every existing test aimed at experiments. The assert tests the result, for instance, the loss, or the rank. I didn't discuss this kind of assertion with anyone, so it's definitely worth looking at these specifics assertions and see if you guys agree with what I did. I also had to change the experiments and scripts in order to be able to read the results from the tests.