about WikiSection dataset

bigbosskai commented 3 years ago

hello, hope you can answer me a question. How do you calculate that the number of documents in WikiSection data is 21376? Is it a mixture of en_city data and en_disease data? And What is the split(train,dev,test) of data?

lxing532 commented 2 years ago

Hi,

When using the public WikiSection dataset, we noticed some samples having punctuations, especially '.' in the section title, making the dataloading code preprocessing the data inappropriately. Therefore, we filter out these "problematic" datapoints to proceed our training procedure.

Ighina commented 2 years ago

Hello,

Still concerning the wikisection dataset, as I can't reproduce your results. I can't see the portion of your code were you preprocess/filter and pass it to the WikipediaDataset class. As the original wikisection dataset consists in separate .json files containing the wikipedia articles, how did you integrate that format into the overall pipeline? Did you have to first re-write all the articles from wikisection in the format accepted by the original TextSeg impelementation (i.e. each article a separate file with ===== delimiting section)?

lxing532 commented 2 years ago

Hi,

Yeah exactly. I processed the wikisection data into the same format as wiki727k to make it easy as input of TextSeg. I will upload a processed wikisection article to this repo later today.

Linzi

Hello,

Still concerning the wikisection dataset, as I can't reproduce your results. I can't see the portion of your code were you preprocess/filter and pass it to the WikipediaDataset class. As the original wikisection dataset consists in separate .json files containing the wikipedia articles, how did you integrate that format into the overall pipeline? Did you have to first re-write all the articles from wikisection in the format accepted by the original TextSeg impelementation (i.e. each article a separate file with ===== delimiting section)?

Ighina commented 2 years ago

Hello, just wanted to ask if you had occasion to look into the dataset and if you will upload the processed wikisection article so that we can see what the format of the input data should be.

Also, when (above) you mentioned that you "noticed some samples having punctuations, especially '.' in the section title, making the dataloading code preprocessing the data inappropriately" is there a way to know which are the files that were filtered out? Thank you

lxing532 commented 2 years ago

Hi,

I had uploaded one example processed file in the folder of sample_input. Please take a look.

Best,

Linzi

Hello, just wanted to ask if you had occasion to look into the dataset and if you will upload the processed wikisection article so that we can see what the format of the input data should be.

Also, when (above) you mentioned that you "noticed some samples having punctuations, especially '.' in the section title, making the dataloading code preprocessing the data inappropriately" is there a way to know which are the files that were filtered out? Thank you

lxing532 / improve_topic_seg

about WikiSection dataset #3