Closed mscherrmann closed 1 year ago
We did not use concat tokens with bert, and preferred to keep the normal bert pretraining style of single doc, [CLS] doc [SEP].
For the splitting vs truncating, its a bit of an empirical question and may depend on how you are going to use the model, but some general direction. If you aren't data constrained, you may want to randomly sample a 512 window (ideally sentence boundary aligned) as opposed to always having the start of a document. If you are, or if you expect shorter docs at inference time, then try to segment your docs in a reasonable way (using a custom parser built based on your understanding of the structure of the documents, or a segmenting model).
Makes totally sense, thank you very much for your help!
Hi,
would you recommend to set the --concat_tokens flag for the BERT pretraining? Did you observe any difference in your experiments?
Furthermore, would you recommend to split documents with more than 512 tokens in two or more documents instead of using truncation?
Thank you very much in advance!