Closed lucadiliello closed 1 year ago
Hi ! Certain users might need these data (for training or simply to explore/index the dataset).
Feel free to implement a map function that gets rid of these paragraphs and process the wikipedia dataset with it before training
Describe the bug
Wikipedia english dumps contain many wikipedia paragraphs like "References", "Category:" and "See Also" that should not be used for training.
Steps to reproduce the bug
Expected results
I expect no junk in the data.
Actual results
Specify the actual results or traceback.
Environment info
datasets
version: 1.10.2