Why are long sentences removed?

jowagner commented 4 years ago

Issue https://github.com/jbrry/Irish-BERT/issues/39#issuecomment-734320495 discovered that the pre-processing pipeline removes lines with more than 100 tokens. Why? Is there a problem feeding very long sentences into BERT?

alanagiasi commented 4 years ago

BERT can take a maximum of 512 input tokens, anything over that is truncated automatically. I'm not familiar with Reformers and Longformers etc. which can take longer inputs.

Another approach by (Zellers et al., 2019) doubled the 512 to 1024 using this technique:

Additionally, BERT was trained with a sequence length of at most 512 WordPiece tokens, but generations from Grover are much longer (1024 BPE tokens). Thus, we initialized new position embeddings for positions 513-1024, and performed domain adaptation at a length of 1024 WordPiece tokens.

jbrry commented 4 years ago

Issue #39 (comment) discovered that the pre-processing pipeline removes lines with more than 100 tokens. Why?

This value is the same as the experiment in Aulamo et al., 2020, Section 3.1 (where the LengthFilter keeps sentences which are between 1 and 100 in length). But yes, good point - there is probably no need for us to be this strict. Their experiment involved corpus filtering for MT so perhaps they didn't want very long sentences.

As Alan has mentioned, BERT has a maximum of 512 tokens. Perhaps we should use that as the maximum sentence length? It's good to know about the method in Zellers et al. (2019). I think that step is probably very important to be able to fully utilise the output from Grover. Given that we do not necessarily need to train on the output of another system that produces long sequences of text and that sentences longer than 512 tokens are probably very rare in our corpora, it probably wouldn't justify the effort on our end. I'm happy to use it if someone wanted to try it out though.

jowagner commented 4 years ago

512 tokens usually have a lot more than 512 word pieces. I guess that 100 is a heuristic to make it likely that multiple sentences fit into the BERT input. In case of multilingual BERT and a target language that is not well represented in the vocabulary, 100 may already be too much for effective training. Furthermore, we have been training with only 1/4 of the 512 sequence length so far, suggesting that this limit should also be lowered. However, going all the way down to 25 may exclude too many sentences and produce a model that is not good at processing longer sentences.

An interesting experiment would be to compare BERT models with data pre-processed with different length limits, say 25, 50, 100 and 200. However, as long sentences are not frequent, I don't expect to see much of a performance difference.

This assumes BERT concatenates as many sentences as needed to reach at least 512 word pieces and then truncates sequence to 512 word pieces. If the first sentence is longer than that it should then simply be truncated as needed. Maybe this understanding is wrong and only full sentences are packed into the input, e.g. many sequences in a batch will be shorter than 512 word pieces.

jbrry / Irish-BERT

Why are long sentences removed? #40