Preprocessing code for TriviaQA dataset

google-research / bigbird

Transformers for Longer Sequences

https://arxiv.org/abs/2007.14062

Apache License 2.0

563 stars 101 forks source link

Preprocessing code for TriviaQA dataset #4

Closed sjy1203 closed 3 years ago

sjy1203 commented 3 years ago

Dear authors,

Do you use the same preprocessing code as Longformer on TriviaQA dataset such as truncating each document less than 4096, answer string match algorithm and normalized aliases as training labels?

ppham27 commented 3 years ago

Yes, our preprocessing code is very similar to Longformer. I made some improvements in how the string matching is done and split larger documents into chunks of size 4096, though.

Our code is implemented as a Beam pipeline with TFDS found here, https://github.com/tensorflow/models/blob/master/official/nlp/projects/triviaqa/dataset.py. Apologies that the documentation is pretty sparse at the moment.