Bug or intended behaviour? Dataset reader ignores document content when no evidence provided

ravenscroftj commented 3 years ago

I'm trying to run your word_emb_encoder_generator on Lei's original beer dataset which I've been able to acquire here - Paranjape et al even provide a script that converts the beer dataset to ERASER json format.

I was seeing accuracy training on this model topping out at 25% which seemed odd so I stepped through the code with a debugger.

The problem is that the rationale_reader dataset reader in your repository does not read the document contents unless the document ID (annotation_id) also appears in the evidences array and the original Lei beer dataset does not provide rationale annotation - ergo no evidences.

My solution was to add the annotation_id to the list of filtered documents that the reader loads to form the instance but I'm not sure if that's the best solution.

I've made a one line change which I'd be happy to submit as a pull request but I wanted to ask you if what I've proposed makes sense before assuming you'd accept it!

successar commented 3 years ago

Hi ! Thanks for letting me know. Eraser format assumes docids are different from annotation ids (since same annotation may depend on multiple docs or vice-versa). The change you propose (adding annotation id = doc id) is specific to how beer dataset generation works by Parajape et al. I don't believe it will be useful here to make this change.

ravenscroftj commented 3 years ago

Thanks for the reply - that makes sense and I'm glad I checked in first!

My understanding of ERASER is that you are primarily concerned with evaluation of generated rationales versus gold standard rationales. Therefore, I'm guessing that the ERASER format isn't really being used for dealing with samples that have no evidence and just a classification (like the beer dataset)? I'm working on trying to induce rationales from a new dataset that's only labelled with the downstream task which is why I'm asking and why I'm messing about with beer.

I notice that the the docids array in the jsonl format seems to do what I need but is not used directly by the dataset reader which gets its list of docids from the list comprehension over evidences on line 52 without ever doing anything with line.docids.

Presumably adding line.docids to the result of this list comprehension inside the set() in order to de-duplicate would be a valid thing to do and would work for other ERASER use cases? Unless I misunderstood, for all of the official ERASER datasets, the docids set is always equal to the set of evidence docids (except when its null).

If I implement this change would it be useful to contribute here or shall I assume that my use case is completely out of scope here?

Thanks again for the follow up!

successar commented 3 years ago

That should be ideal case but I don't believe we filled in the docids for the eraser datasets currently up. Changing to use docids will break reproducibility at this time.

ravenscroftj commented 3 years ago

Ok thanks, I'll make the changes in my fork for now :+1: thanks again

successar / Eraser-Benchmark-Baseline-Models

Bug or intended behaviour? Dataset reader ignores document content when no evidence provided #3