Closed tokee closed 2 years ago
Well, now all of a sudden the CI is complaining. Perhaps updating the other libraries (JSoup?) caused a behaviour change because now I get:
Error: Failures:
Error: HTMLAnalyserTest.testIllegalDomainHandling:118 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com,/ uuid:123456-1234-1234-12345678, http://example.org/not a link at all, http://example.com&arguments,/ http://æblegrød.dk] expected:<5> but was:<6>
Error: HTMLAnalyserTest.testIllegalHostHandling:99 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com,/ uuid:123456-1234-1234-12345678, http://example.org/not a link at all, http://example.com&arguments,/ http://æblegrød.dk] expected:<5> but was:<6>
Yes, reverting JSoup resolves the problem. Just checking what links it used to find... Setting expected number to 6 links not 5, while running JSoup 1.13.1...
[ERROR] Failures:
[ERROR] HTMLAnalyserTest.testIllegalDomainHandling:118 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com, http://example.org/not a link at all, http://example.com&arguments, http://æblegrød.dk] expected:<6> but was:<5>
[ERROR] HTMLAnalyserTest.testIllegalHostHandling:99 The number of links should be correct. Got links: links=[http://example.org/, http://valid.example.com, http://example.org/not a link at all, http://example.com&arguments, http://æblegrød.dk] expected:<6> but was:<5>
i.e. older JSoup was ignoring the UUID link, so the new behaviour is correct, I think, and the tests should be updated to match.
The host links extractor was too lenient, accepting things like
example.com&
and very long entries (see #281). This pull requests introduces better validation.