opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

ETL Web: Parse last modification date from webserver #10

Open opensemanticsearch opened 7 years ago

opensemanticsearch commented 7 years ago

After upgrade to Python 3 with urllib problem with parsing last modification date from webserver like

Wed, 21 Jun 2017 11:35:20 +0000

The now used dateutil parser seems not to be able to parse it.

The old Python 2 library urllib2 was able to parse and return structured time by headers.getdate() ...

Is there a lib that can handle different web server timestamp formats? Using time.strptime() for a special format would be limited to only this special format.

Mandalka commented 7 years ago

The HTTP header "date" with the format Wed, 21 Jun 2017 11:35:20 GMT can be parsed without problems by dateutil parser.

So this affects only the HTTP header "last-modified"

Mandalka commented 6 years ago

Will try https://github.com/scrapinghub/dateparser

clamor commented 6 years ago

How about (python 3.6.4):

from datetime import datetime ddate = datetime.strptime('Wed, 21 Jun 2017 11:35:20 +0000', '%a, %d %b %Y %H:%M:%S %z')

Mandalka commented 4 years ago

The problem is not parsing one special format, but all different possible formats.

Seems this tool could provide good heuristic results or solutions:

https://github.com/adbar/htmldate

opensemanticsearch commented 4 years ago

Another lib to evaluate: https://github.com/akoumjian/datefinder