scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.53k stars 463 forks source link

Dataset of article publication dates as they appear on the web #928

Open lopuhin opened 3 years ago

lopuhin commented 3 years ago

Hi, on behalf of Automatic Extraction team from Zyte, I'd like to thank dateparser developers for a great library, and share a dataset of article publication dates as they appear on the web - I think it can be useful for the library development. It was collected on Oct 2020 - Mar 2021, has 300k rows, and contains the following columns:

Here are the first few rows:

suffix articleLanguage webPageLanguages datePublishedRaw
eu en en lg October 22, 2020
es es es ca 2020-10-09T17:53:37+02:00
fr fr fr en 02/10/2020 14:53:07
ie en en 2020-10-19T12:54:52Z
com en en th 28 Oct 2020 at 13:50

And here it dataset in full: article_date_sample_Oct_2020_Mar_2021_public.csv.zip

noviluni commented 3 years ago

Hey @lopuhin, thanks for this! I'm sure we will take some good insights from it!