IndexWorker spins forever on flv record with incorrect content-type text/html

nlevitt commented 10 years ago

IndexWorker spins forever on (one particular) flv record with incorrect content-type text/html.

nlevitt commented 10 years ago

Damn it, I can't attach a file to this issue. Ugh. Ok here it is hopefully https://ia902302.us.archive.org/7/items/problem_201409/problem.warc (16 mb).

To reproduce, run org.archive.wayback.resourcestore.indexer.IndexWorker with one argument, the path to that warc.

anjackson commented 10 years ago

I created a test case (not committed as I'm not sure of licensing of problem.warc and maybe it's a bit on the big side for a test file):

        IndexWorker iw = new IndexWorker();
        iw.setInterval(0);
        iw.init();

        CloseableIterator<CaptureSearchResult> itr = iw.indexFile("src/test/resources/problem.warc");
        CDXFormat cdxFormat = new CDXFormat(CDXFormatIndex.CDX_HEADER_MAGIC);
        Iterator<String> lines = 
            SearchResultToCDXFormatAdapter.adapt(itr, cdxFormat);
        System.out.println(CDXFormatIndex.CDX_HEADER_MAGIC);
        while(lines.hasNext()) {
            System.out.println(lines.next());
        }

Then used jstack to see what the stuck thread was up to:

"main" prio=5 tid=7fa5cb800800 nid=0x10d4e5000 runnable [10d4e3000]
   java.lang.Thread.State: RUNNABLE
    at org.htmlparser.lexer.Lexer.parseJsp(Lexer.java:1368)
    at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:359)
    at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:65)
    at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156)
    at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:1)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorkerTest.testIndexFile(IndexWorkerTest.java:46)

It seems the HTML parser has managed to get itself into a infinite loop. TBH, I'm surprised to find that the CDX indexer is parsing the HTML at all (at least by default).

anjackson commented 10 years ago

Looking deeper inside, we find:

        // Now the sticky part: If it looks like an HTML document, look for
        // robot meta tags:
        if(isHTML(mimeType)) {
            String fileContext = result.getFile() + ":" + result.getOffset();
            annotateHTMLContent(is, encoding, fileContext, result);
        }

So, it seems a FLV is being parsed as HTML because the Content-Type was wrong, and the HTML parser is confounded.

FWIW, when parsing HTML etc. we have found that we have to wrap any such parser in a Thread that we can kill after a time-out. So that's one option.

nlevitt commented 10 years ago

@kngenie is working on the IA fork to improve content-type detection for link rewriting during playback. Maybe that logic can be used for this, too, when it's ready.

RogerMathisen commented 9 years ago

The IA content type detection that @kngenie was working on will be part of OpenWayback as of 2.1.0. The functionality introduced in: "wayback-core/src/main/java/org/archive/wayback/replay/mimetype" might be used to solve this Issue.

iipc / openwayback

IndexWorker spins forever on flv record with incorrect content-type text/html #162