ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Timsort comparison error for specific robots.txt URL #86

Open anjackson opened 1 year ago

anjackson commented 1 year ago

From DC

Nov 28, 2022 9:48:29 AM org.archive.modules.CrawlURI getPolitenessDelay
WARNING: politessDelay unset, returning default 5000 for https://www.english.op.org/robots.txt (in thread 'ToeThread #47: https://www.english.op.org/robots.txt')
Nov 28, 2022 9:48:35 AM org.archive.crawler.framework.ToeThread recoverableProblem
SEVERE: Problem java.lang.IllegalArgumentException: Comparison method violates its general contract! occurred when trying to process 'https://www.english.op.org/robots.txt' at step ABOUT_TO_BEGIN_PROCESSOR in 
 (in thread 'ToeThread #498: https://www.english.op.org/robots.txt')
java.lang.IllegalArgumentException: Comparison method violates its general contract!
    at java.util.TimSort.mergeHi(TimSort.java:899)
    at java.util.TimSort.mergeAt(TimSort.java:516)
    at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
    at java.util.TimSort.sort(TimSort.java:254)
    at java.util.Arrays.sort(Arrays.java:1512)
    at java.util.ArrayList.sort(ArrayList.java:1464)
    at java.util.Collections.sort(Collections.java:177)
    at org.apache.http.impl.cookie.RFC6265CookieSpec.formatCookies(RFC6265CookieSpec.java:217)
    at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:187)
    at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:133)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:823)
    at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:679)
    at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    at org.archive.modules.Processor.process(Processor.java:142)
    at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:147)

...the content (as seen in my web browser) appears to be:

# START YOAST BLOCK
# ---------------------------
User-agent: *
Disallow:

Sitemap: https://www.english.op.org/sitemap_index.xml
# ---------------------------
# END YOAST BLOCK
anjackson commented 1 year ago

Also hit this during DC2023 and it made the crawl very unhappy until I blocked the host.

https://www.ramirezmoto.es/robots.txt