mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
435 stars 124 forks source link

--concat_tokens flag in BERT pretraining #431

Closed mscherrmann closed 1 year ago

mscherrmann commented 1 year ago

Hi,

would you recommend to set the --concat_tokens flag for the BERT pretraining? Did you observe any difference in your experiments?

Furthermore, would you recommend to split documents with more than 512 tokens in two or more documents instead of using truncation?

Thank you very much in advance!

dakinggg commented 1 year ago

We did not use concat tokens with bert, and preferred to keep the normal bert pretraining style of single doc, [CLS] doc [SEP].

For the splitting vs truncating, its a bit of an empirical question and may depend on how you are going to use the model, but some general direction. If you aren't data constrained, you may want to randomly sample a 512 window (ideally sentence boundary aligned) as opposed to always having the start of a document. If you are, or if you expect shorter docs at inference time, then try to segment your docs in a reasonable way (using a custom parser built based on your understanding of the structure of the documents, or a segmenting model).

mscherrmann commented 1 year ago

Makes totally sense, thank you very much for your help!