Closed Clement25 closed 3 months ago
Hi, thank you for your interest in our work! Unfortunately, there is nothing we can do due to copyright issues. We did find that the Books domain is critical to the performance for long-context models, but the RP Books domain actually contains Project Gutenberg books as well (PG19). You can try using only the PG19 data, but it's hard to say how much performance would be affected.
Hi, thanks for the awesome work. However, during dataset collection, I found Books3 in RP has already been removed due to copyright issue. Is there any other possible resource for this dataset? I thought it was indispensible and significant for the training process as you mentioned in your paper.