Dataset of article publication dates as they appear on the web

Hi, on behalf of Automatic Extraction team from Zyte, I'd like to thank dateparser developers for a great library, and share a dataset of article publication dates as they appear on the web - I think it can be useful for the library development. It was collected on Oct 2020 - Mar 2021, has 300k rows, and contains the following columns:

suffix is the TLD (top level domain), e.g. com or ru, extracted with tldextract.extract(domain).suffix
articleLanguage is the language of the article (article.inLanguage field from https://docs.zyte.com/automatic-extraction/article.html)
webPageLanguages is a space separated list of web page languages (webPage.inLanguages from https://docs.zyte.com/automatic-extraction/article.html)
datePublishedRaw is the publication date of the article as it appeared on the web-site (article.datePublishedRaw from https://docs.zyte.com/automatic-extraction/article.html) - non-empty.

Here are the first few rows:

suffix	articleLanguage	webPageLanguages	datePublishedRaw
eu	en	en lg	October 22, 2020
es	es	es ca	2020-10-09T17:53:37+02:00
fr	fr	fr en	02/10/2020 14:53:07
ie	en	en	2020-10-19T12:54:52Z
com	en	en th	28 Oct 2020 at 13:50

And here it dataset in full: article_date_sample_Oct_2020_Mar_2021_public.csv.zip

scrapinghub / dateparser

Dataset of article publication dates as they appear on the web #928