openweb-ai / thematic

Apache License 2.0
1 stars 0 forks source link

[FEATURE] Downloader for standard benchmark datasets #3

Open marc-chan opened 2 years ago

marc-chan commented 2 years ago

Description

Compile a list of various corpora from different domains, for evaluation of implemented algorithms against other popular topic modelling techniques. Ideally, selected corpus should have a date-like field for evaluation of temporal aware topic modelling techniques as well. For convenience, to then implement a simple downloader that will load and transform the various corpora to a standardised format.

marc-chan commented 2 years ago

Benchmark corpus candidate:

PAN @ SemEval 2019 Task 4: Hyperpartisan News Detection Source Data

Format: XML

Datafields: id: str published-at: date title: str article str