Hi, on behalf of Automatic Extraction team from Zyte, I'd like to thank dateparser developers for a great library, and share a dataset of article publication dates as they appear on the web - I think it can be useful for the library development. It was collected on Oct 2020 - Mar 2021, has 300k rows, and contains the following columns:
suffix is the TLD (top level domain), e.g. com or ru, extracted with tldextract.extract(domain).suffix
Hi, on behalf of Automatic Extraction team from Zyte, I'd like to thank dateparser developers for a great library, and share a dataset of article publication dates as they appear on the web - I think it can be useful for the library development. It was collected on Oct 2020 - Mar 2021, has 300k rows, and contains the following columns:
suffix
is the TLD (top level domain), e.g.com
orru
, extracted withtldextract.extract(domain).suffix
articleLanguage
is the language of the article (article.inLanguage
field from https://docs.zyte.com/automatic-extraction/article.html)webPageLanguages
is a space separated list of web page languages (webPage.inLanguages
from https://docs.zyte.com/automatic-extraction/article.html)datePublishedRaw
is the publication date of the article as it appeared on the web-site (article.datePublishedRaw
from https://docs.zyte.com/automatic-extraction/article.html) - non-empty.Here are the first few rows:
And here it dataset in full: article_date_sample_Oct_2020_Mar_2021_public.csv.zip