Dataset size - Githubissues

I asking for clear up the amount of data samples contained in this repository:

The paper states 204,004 samples after filtering the repository states more than 200k. When using the wikihowall.csv together with hugginface load dataset (following the instructions and manually downloading & inserting the file) the resulting dataset only has ~168k samples. Where does this discrepancy originate from?

DatasetDict({ train: Dataset({ features: ['text', 'headline', 'title'], num_rows: 157252 }) validation: Dataset({ features: ['text', 'headline', 'title'], num_rows: 5599 }) test: Dataset({ features: ['text', 'headline', 'title'], num_rows: 5577 }) })

Applying the process.py found in this repo results in 180k files still not the 204k mentioned din the paper.

mahnazkoupaee / WikiHow-Dataset

Dataset size #23