Open bigbosskai opened 3 years ago
Hi,
When using the public WikiSection dataset, we noticed some samples having punctuations, especially '.' in the section title, making the dataloading code preprocessing the data inappropriately. Therefore, we filter out these "problematic" datapoints to proceed our training procedure.
Hello,
Still concerning the wikisection dataset, as I can't reproduce your results. I can't see the portion of your code were you preprocess/filter and pass it to the WikipediaDataset class. As the original wikisection dataset consists in separate .json files containing the wikipedia articles, how did you integrate that format into the overall pipeline? Did you have to first re-write all the articles from wikisection in the format accepted by the original TextSeg impelementation (i.e. each article a separate file with ===== delimiting section)?
Hi,
Yeah exactly. I processed the wikisection data into the same format as wiki727k to make it easy as input of TextSeg. I will upload a processed wikisection article to this repo later today.
Linzi
Hello,
Still concerning the wikisection dataset, as I can't reproduce your results. I can't see the portion of your code were you preprocess/filter and pass it to the WikipediaDataset class. As the original wikisection dataset consists in separate .json files containing the wikipedia articles, how did you integrate that format into the overall pipeline? Did you have to first re-write all the articles from wikisection in the format accepted by the original TextSeg impelementation (i.e. each article a separate file with ===== delimiting section)?
Hello, just wanted to ask if you had occasion to look into the dataset and if you will upload the processed wikisection article so that we can see what the format of the input data should be.
Also, when (above) you mentioned that you "noticed some samples having punctuations, especially '.' in the section title, making the dataloading code preprocessing the data inappropriately" is there a way to know which are the files that were filtered out? Thank you
Hi,
I had uploaded one example processed file in the folder of sample_input. Please take a look.
Best,
Linzi
Hello, just wanted to ask if you had occasion to look into the dataset and if you will upload the processed wikisection article so that we can see what the format of the input data should be.
Also, when (above) you mentioned that you "noticed some samples having punctuations, especially '.' in the section title, making the dataloading code preprocessing the data inappropriately" is there a way to know which are the files that were filtered out? Thank you
hello, hope you can answer me a question. How do you calculate that the number of documents in WikiSection data is 21376? Is it a mixture of en_city data and en_disease data? And What is the split(train,dev,test) of data?