mahnazkoupaee / WikiHow-Dataset

A Large Scale Text Summarization Dataset
330 stars 41 forks source link

Dataset size #23

Closed DanielRoeder1 closed 11 months ago

DanielRoeder1 commented 12 months ago

I asking for clear up the amount of data samples contained in this repository:

The paper states 204,004 samples after filtering the repository states more than 200k. When using the wikihowall.csv together with hugginface load dataset (following the instructions and manually downloading & inserting the file) the resulting dataset only has ~168k samples. Where does this discrepancy originate from?

DatasetDict({ train: Dataset({ features: ['text', 'headline', 'title'], num_rows: 157252 }) validation: Dataset({ features: ['text', 'headline', 'title'], num_rows: 5599 }) test: Dataset({ features: ['text', 'headline', 'title'], num_rows: 5577 }) })

Applying the process.py found in this repo results in 180k files still not the 204k mentioned din the paper.

mahnazkoupaee commented 12 months ago

As for the number of pairs in the dataset, once the data is extracted from WikiHow knowledge base and the pairs are constructed and preprocessed as stated in the paper, the number of pairs is around 204k.

However, for our experiments, we used a threshold of 3/4 to remove articles with summaries longer than the 3/4 of the length of the articles. Therefore, the resulting data using the process.py script (in the GitHub repo) contains around 180K pairs and the titles file also contains 180k titles.