clear task scorers before evaluating / validating

nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models

MIT License

21 stars 9 forks source link

clear task scorers before evaluating / validating #185

Open jeswan opened 4 years ago

jeswan commented 4 years ago

Issue by W4ngatang Wednesday Jul 18, 2018 at 03:15 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/185

Before validating or evaluation, we should reset each task's scorers to ensure we're validating on just the data from the split. This will likely causes division by zero errors we'll need to catch.

jeswan commented 4 years ago

Comment by iftenney Wednesday Jul 18, 2018 at 03:25 GMT

Oh dear. Wait, we're not doing this already? Does this invalidate results?

jeswan commented 4 years ago

Comment by W4ngatang Wednesday Jul 18, 2018 at 03:33 GMT

I'm pretty sure we do it before validating but outside the function

On Tue, Jul 17, 2018, 23:25 Ian Tenney notifications@github.com wrote:

Oh dear. Wait, we're not doing this already? Does this invalidate results?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jsalt18-sentence-repl/jiant/issues/185#issuecomment-405797148, or mute the thread https://github.com/notifications/unsubscribe-auth/AFQ7GzfX6LlCxUI1M0xWlssCz6TX5V_nks5uHqq-gaJpZM4VT4Bn .

jeswan commented 4 years ago

Comment by iftenney Wednesday Jul 18, 2018 at 03:37 GMT

Hrm, I think we should look in to this, at least to rule out a potentially result-invalidating bug.

jeswan commented 4 years ago

Comment by sleepinyourhat Wednesday Jul 18, 2018 at 03:50 GMT

https://xkcd.com/1574/

On Tue, Jul 17, 2018 at 11:37 PM Ian Tenney notifications@github.com wrote:

Hrm, I think we should look in to this, at least to rule out a potentially result-invalidating bug.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jsalt18-sentence-repl/jiant/issues/185#issuecomment-405798751, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOZWfTWnLWissVU0SL6S9qZbAtol0IRks5uHq2LgaJpZM4VT4Bn .

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jul 27, 2018 at 16:47 GMT

@W4ngatang - Mind checking on this? I didn't see a problem myself. Should resolve one way or another soon.

jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Aug 30, 2018 at 16:27 GMT

In other experiments, I confirmed that the final validation metrics are correct across a few datasets—I wrote predictions to disk, manually computed the metric, and compared with the reported number.

However, if anything is wrong with the mid-training validation metrics, that'd be a bit harder to catch, and could impact early stopping. @W4ngatang - Mind giving this a quick look?