princeton-nlp / CEPE

[ACL 2024] Long-Context Language Modeling with Parallel Encodings
https://arxiv.org/abs/2402.16617
MIT License
135 stars 9 forks source link

How to obtain Books3 dataset? #5

Closed Clement25 closed 3 months ago

Clement25 commented 3 months ago

Hi, thanks for the awesome work. However, during dataset collection, I found Books3 in RP has already been removed due to copyright issue. Is there any other possible resource for this dataset? I thought it was indispensible and significant for the training process as you mentioned in your paper.

howard-yen commented 3 months ago

Hi, thank you for your interest in our work! Unfortunately, there is nothing we can do due to copyright issues. We did find that the Books domain is critical to the performance for long-context models, but the RP Books domain actually contains Project Gutenberg books as well (PG19). You can try using only the PG19 data, but it's hard to say how much performance would be affected.