nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

[CLOSED] Update load_tsv to stop if "bad lines" are found #1083

Closed jeswan closed 4 years ago

jeswan commented 4 years ago

Issue by pyeres Thursday Apr 30, 2020 at 17:41 GMT Originally opened as https://github.com/nyu-mll/jiant/pull/1083


This PR is motivated by issue #1082 (please see that issue for important context).

This PR would change the behavior of load_tsv(): if pandas finds a "bad line" in the data passed to load_tsv(), a pandas.errors.ParserError will be thrown and the run will stop.

How was this PR validated/tested? For all tasks where I had task data on hand (10+ tasks including SNLI), on this PR branch (i.e., w/ error_bad_lines=True) I ran code to trigger a call to load_tsv(), and for all tasks except SNLI load_tsv() worked without an exception. For SNLI (where we know there's some inconsistency in the format of the data file) this (expected) exception is raised:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 13, saw 15

What will this break: this change will cause SNLI task data loading to fail with exception like above until another fix (e.g., an upstream fix to rely on SNLI data without "bad lines") is applied.


pyeres included the following code: https://github.com/nyu-mll/jiant/pull/1083/commits

jeswan commented 4 years ago

Comment by pep8speaks Thursday Apr 30, 2020 at 17:41 GMT


Hello @pyeres! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 124:96: W291 trailing whitespace

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2020-05-21 18:53:23 UTC
jeswan commented 4 years ago

Comment by sleepinyourhat Thursday Apr 30, 2020 at 18:27 GMT


@W4ngatang - Could you look into the SNLI issue? It sounds like that's specific to the GLUE-downloader version that jiant uses by default, and if that's easy to fix, it's probably better to fix it before we merge this PR.