ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

Links_hosts field normalize error #281

Closed thomasegense closed 2 years ago

thomasegense commented 2 years ago

The links_hosts field does not parse hosts correct from URLS. On way the bug isarises when parsing urls of the format domain.com& (That is a domain followed by &)

But it can also go arbitary wrong and result in a long clear text sentences. (See #2). I think this happens if the source has

Internal ID (danish netarchive) to see the error: id:"zH2/EFy2zXeAJA2GEiyfBA==/20190321165218" (Domain error) id:"aorvbxwwtWV+dXXcIuMRtg==/20180304232040" (clear tekst)

thomasegense commented 2 years ago

This should be assigned to @tokee

tokee commented 2 years ago

I have attempted a hardening of the host part, but I am not sure if it solves the problem without a clear test-case. Could you verify https://github.com/netarchivesuite/webarchive-discovery/tree/host_normalize