Closed vateye closed 8 months ago
Hi @vateye thanks for your question. We used the whole document when it was possible. However, some documents contain a number of tokens higher than the maximum allowed by the pre-trained LM, so in this case we had to truncate.
Hi, I have a question about the pre-training stage for OBELISC. Did you use the whole document for pre-training or just the image and its paired text for training?