ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116 stars 25 forks source link

Warc-Indexer remove port :80 from url/links when normalising. #284

Open thomasegense opened 2 years ago

thomasegense commented 2 years ago

This is an example of an url_norm in Solr with the port 80. url_norm:"http://train-aarhus.dk:80/visbillede.asp?fp=brandnewheavies.jpg"

In this case the url comes from the ARC (not WARC) header:

Arc Header

http://train-aarhus.dk:80/visbillede.asp?fp=brandnewheavies.jpg 194.239.250.54 20001021042018 text/html 1699

HTTP/1.1 200 OK

Server: Microsoft-IIS/4.0


Also when parsing links (a href) on a page port 80 should also be removed. Having links with and without port 80 will result in playback issues since url can not be matched.

Same goes for https port 443