sul-dlss / was_robot_suite

Robots for Web Archiving Service accessioning and dissemination
Other
0 stars 2 forks source link

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

Open andrewjbtw opened 4 years ago

andrewjbtw commented 4 years ago

79 web archive crawl objects have errors in the wasCrawlDisseminationWF of the form:

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer /web-archiving-stacks/data/collections/kh149kf8484/bj/330/fg/0526/CDL-20100320000124-00037-oriole.ucop.edu-00131663.arc.gz /web-archiving-stacks/data/indices/cdx_working//druid:bj330fg0526/CDL-20100320000124-00037-oriole.ucop.edu-00131663.cdx 2>> log/cdx_indexer.log pid 959 exit 1

Some of these errors date back at least 4 years. I tried resetting the step for some items but they return to an error state.

Link to all current items with errors: https://argo.stanford.edu/catalog?f%5Bwf_wps_ssim%5D%5B%5D=wasCrawlDisseminationWF%3Acdx-generator%3Aerror&per_page=100

Link to the one item with this error that was accessioned within the past year: https://argo.stanford.edu/view/druid:cv292vs5727

jcoyne commented 4 years ago

In the log I see lots of similar Java stacktraces:

java.io.IOException: Failed parse of http status line.
        at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
        at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
        at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
        at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
java.io.IOException: Failed parse of http status line.
        at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)

Does anyone know what that means?

aaron-collier commented 4 years ago

Looks like a known (and old) issue: https://github.com/iipc/openwayback/issues/14

aaron-collier commented 4 years ago

Might be related (maybe not, just capturing it here). When I run the command manually, at the end I get:

WARNING: Trying skip of failed record cleanup of {WARC-Type=response, reader-identifier=/web-archiving-stacks/data/collections/xs048zp7815/cv/292/vs/5727/ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz, WARC-Date=2016-12-11T13:29:18Z, absolute-offset=471946099, Content-Length=19248744, WARC-Record-ID=<urn:uuid:f5405d28-65d0-4754-a1e5-b2175ede1bbc>, WARC-Payload-Digest=sha1:FXV6QEH674CTRXLMLVR5YQLEQCXS6D4X, WARC-IP-Address=54.230.141.103, WARC-Target-URI=http://dr6lcqo3bxtwa.cloudfront.net/binary/2016/12/9/23/1437582013143-p5k4ma/20161208b_CSPAN_BoxerFilibuster-1481328291115.mp4, Content-Type=application/http; msgtype=response}: invalid block type
java.util.zip.ZipException: invalid block type
    at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
    at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
    at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
    at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
    at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
    at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
    at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
    at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.hasNext(ArchiveReaderCloseableIterator.java:37)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

May 04, 2020 4:06:44 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {WARC-Type=response, reader-identifier=/web-archiving-stacks/data/collections/xs048zp7815/cv/292/vs/5727/ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz, WARC-Date=2016-12-11T13:29:18Z, absolute-offset=471946099, Content-Length=19248744, WARC-Record-ID=<urn:uuid:f5405d28-65d0-4754-a1e5-b2175ede1bbc>, WARC-Payload-Digest=sha1:FXV6QEH674CTRXLMLVR5YQLEQCXS6D4X, WARC-IP-Address=54.230.141.103, WARC-Target-URI=http://dr6lcqo3bxtwa.cloudfront.net/binary/2016/12/9/23/1437582013143-p5k4ma/20161208b_CSPAN_BoxerFilibuster-1481328291115.mp4, Content-Type=application/http; msgtype=response}: invalid block type
java.util.zip.ZipException: invalid block type
    at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
    at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
    at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
    at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
    at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
    at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
    at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
    at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

May 04, 2020 4:06:44 PM org.archive.io.ArchiveReader$ArchiveRecordIterator next
WARNING: Bad Record. Trying skip (Record start 471946099): invalid block type
Exception in thread "main" java.lang.RuntimeException: After retry (Offset 471946099)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
Caused by: java.util.zip.ZipException: invalid block type
    at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
    at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
    at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
    at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:126)
    at org.archive.util.LaxHttpParser.readRawLine(LaxHttpParser.java:84)
    at org.archive.util.LaxHttpParser.readLine(LaxHttpParser.java:112)
    at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:113)
    at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
    at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader$1.innerNext(WARCReaderFactory.java:290)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
    ... 7 more
[
aaron-collier commented 4 years ago

Perhaps for dev discussion (or a specific planning meeting on WAS), but I think the items with the above issues are the result of corrupt source files.

If I try to manually unzip the file from above:

was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

gzip: ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz: invalid compressed data--format violated

But if I unzip a different file:

[was@was-robots1-prod testing]$ gunzip webrecorder-cidr-20160512212029480-00000-6-5d875e632943.warc.gz
[was@was-robots1-prod testing]$ ls
ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz  webrecorder-cidr-20160512212029480-00000-6-5d875e632943.warc

That worked fine. So I suspect we'll need a way re-download/start the source file. Just a hunch at this point.

andrewjbtw commented 4 years ago

Adding a note that this error affects the Stanford University Websites collection (specifically druid cv292vs5727), which is a collection that we're hoping to actively accession in the near future.

aaron-collier commented 4 years ago

Just some continued investigation:

The zip files adjacent to the one I believe is corrupt are fine:

[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

gzip: ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz: invalid compressed data--format violated
[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211133805177-00052.warc.gz
[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211135951009-00053.warc.gz

Also of note, it is much smaller than all of the other QUARTERLY zips:

-rw-r--r-- 1 was was 1089840585 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211135951009-00053.warc.gz
-rw-r--r-- 1 was was 1336199543 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211133805177-00052.warc.gz
-rw-r--r-- 1 was was  483459072 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz