ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Host links validation #283

Closed tokee closed 2 years ago

tokee commented 2 years ago

The host links extractor was too lenient, accepting things like example.com& and very long entries (see #281). This pull requests introduces better validation.

anjackson commented 2 years ago

Well, now all of a sudden the CI is complaining. Perhaps updating the other libraries (JSoup?) caused a behaviour change because now I get:

Error:  Failures: 
Error:    HTMLAnalyserTest.testIllegalDomainHandling:118 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com,/ uuid:123456-1234-1234-12345678, http://example.org/not a link at all, http://example.com&arguments,/ http://æblegrød.dk] expected:<5> but was:<6>
Error:    HTMLAnalyserTest.testIllegalHostHandling:99 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com,/ uuid:123456-1234-1234-12345678, http://example.org/not a link at all, http://example.com&arguments,/ http://æblegrød.dk] expected:<5> but was:<6>
anjackson commented 2 years ago

Yes, reverting JSoup resolves the problem. Just checking what links it used to find... Setting expected number to 6 links not 5, while running JSoup 1.13.1...

[ERROR] Failures:
[ERROR]   HTMLAnalyserTest.testIllegalDomainHandling:118 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com, http://example.org/not a link at all, http://example.com&arguments, http://æblegrød.dk] expected:<6> but was:<5>
[ERROR]   HTMLAnalyserTest.testIllegalHostHandling:99 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com, http://example.org/not a link at all, http://example.com&arguments, http://æblegrød.dk] expected:<6> but was:<5>

i.e. older JSoup was ignoring the UUID link, so the new behaviour is correct, I think, and the tests should be updated to match.