ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Error indexing WARCs #295

Closed VictorHarbo closed 1 year ago

VictorHarbo commented 1 year ago

When Indexing WARC files I encountered this error: Error indexing test_warcs/warcfilename-00000.warc.gz (return code != 0)

The log gave the following output:

Parsing Archive File [1/1]:warcfile.warc.gz
WARN  HashedInputStream - Hashes are not equal for 'https://www.instagram.com/robots.txt'. WARC-header: sha1:ETOSJAUJR7RNMPCNQWBYO3CNCLGBMOOJ, content: sha1:ADLJUKBPVD5C4LURSVVMU2HC4FFRDTK6
Exception in thread "timelimiter_1662110408211" java.lang.UnsatisfiedLinkError: Can't load library: /usr/lib/jvm/java-11-openjdk-amd64/lib/libawt_xawt.so
    at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2633)
    at java.base/java.lang.Runtime.load0(Runtime.java:768)
    at java.base/java.lang.System.load(System.java:1837)
    at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method)
    at java.base/java.lang.ClassLoader$NativeLibrary.load(ClassLoader.java:2445)
    at java.base/java.lang.ClassLoader$NativeLibrary.loadLibrary(ClassLoader.java:2501)
    at java.base/java.lang.ClassLoader.loadLibrary0(ClassLoader.java:2700)
    at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2651)
    at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:830)
    at java.base/java.lang.System.loadLibrary(System.java:1873)
    at java.desktop/java.awt.Toolkit$3.run(Toolkit.java:1399)
    at java.desktop/java.awt.Toolkit$3.run(Toolkit.java:1397)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.desktop/java.awt.Toolkit.loadLibraries(Toolkit.java:1396)
    at java.desktop/java.awt.Toolkit.<clinit>(Toolkit.java:1429)
    at java.desktop/sun.awt.AppContext$2.run(AppContext.java:282)
    at java.desktop/sun.awt.AppContext$2.run(AppContext.java:271)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.desktop/sun.awt.AppContext.initMainAppContext(AppContext.java:271)
    at java.desktop/sun.awt.AppContext$3.run(AppContext.java:326)
    at java.desktop/sun.awt.AppContext$3.run(AppContext.java:309)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.desktop/sun.awt.AppContext.getAppContext(AppContext.java:308)
    at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:129)
    at java.desktop/javax.imageio.ImageIO.<clinit>(ImageIO.java:66)
    at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at uk.bl.wa.analyser.payload.TikaPayloadAnalyser$ParseRunner.run(TikaPayloadAnalyser.java:545)
    at java.base/java.lang.Thread.run(Thread.java:829)

I used Oracles Java 11.0.16. After changing to OpenJDK the indexing worked.

anjackson commented 1 year ago

The issue appears to be that the headless version of the JDK is not enough, in this case because the Tika ImageIO parser needs it. I updated the Quick Start accordingly.

EDIT: Thanks for letting us know! 😄