Open nlevitt opened 10 years ago
Damn it, I can't attach a file to this issue. Ugh. Ok here it is hopefully https://ia902302.us.archive.org/7/items/problem_201409/problem.warc (16 mb).
To reproduce, run org.archive.wayback.resourcestore.indexer.IndexWorker with one argument, the path to that warc.
I created a test case (not committed as I'm not sure of licensing of problem.warc and maybe it's a bit on the big side for a test file):
IndexWorker iw = new IndexWorker();
iw.setInterval(0);
iw.init();
CloseableIterator<CaptureSearchResult> itr = iw.indexFile("src/test/resources/problem.warc");
CDXFormat cdxFormat = new CDXFormat(CDXFormatIndex.CDX_HEADER_MAGIC);
Iterator<String> lines =
SearchResultToCDXFormatAdapter.adapt(itr, cdxFormat);
System.out.println(CDXFormatIndex.CDX_HEADER_MAGIC);
while(lines.hasNext()) {
System.out.println(lines.next());
}
Then used jstack
to see what the stuck thread was up to:
"main" prio=5 tid=7fa5cb800800 nid=0x10d4e5000 runnable [10d4e3000]
java.lang.Thread.State: RUNNABLE
at org.htmlparser.lexer.Lexer.parseJsp(Lexer.java:1368)
at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:359)
at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:65)
at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156)
at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:1)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourcestore.indexer.IndexWorkerTest.testIndexFile(IndexWorkerTest.java:46)
It seems the HTML parser has managed to get itself into a infinite loop. TBH, I'm surprised to find that the CDX indexer is parsing the HTML at all (at least by default).
Looking deeper inside, we find:
// Now the sticky part: If it looks like an HTML document, look for
// robot meta tags:
if(isHTML(mimeType)) {
String fileContext = result.getFile() + ":" + result.getOffset();
annotateHTMLContent(is, encoding, fileContext, result);
}
So, it seems a FLV is being parsed as HTML because the Content-Type
was wrong, and the HTML parser is confounded.
FWIW, when parsing HTML etc. we have found that we have to wrap any such parser in a Thread that we can kill after a time-out. So that's one option.
@kngenie is working on the IA fork to improve content-type detection for link rewriting during playback. Maybe that logic can be used for this, too, when it's ready.
The IA content type detection that @kngenie was working on will be part of OpenWayback as of 2.1.0. The functionality introduced in: "wayback-core/src/main/java/org/archive/wayback/replay/mimetype" might be used to solve this Issue.
IndexWorker spins forever on (one particular) flv record with incorrect content-type text/html.