Closed thomasegense closed 2 years ago
This should be assigned to @tokee
I have attempted a hardening of the host part, but I am not sure if it solves the problem without a clear test-case. Could you verify https://github.com/netarchivesuite/webarchive-discovery/tree/host_normalize
The links_hosts field does not parse hosts correct from URLS. On way the bug isarises when parsing urls of the format domain.com& (That is a domain followed by &)
But it can also go arbitary wrong and result in a long clear text sentences. (See #2). I think this happens if the source has
Internal ID (danish netarchive) to see the error: id:"zH2/EFy2zXeAJA2GEiyfBA==/20190321165218" (Domain error) id:"aorvbxwwtWV+dXXcIuMRtg==/20180304232040" (clear tekst)