javadoc/jsdoc searchable docs are malformed

When we ingest javadoc and jsdoc for docs.marklogic.com, an attempt is made to store it in both the original HTML form and as cleaned up XHTML. The idea being, I think, that what is produced by these third party apps isn't always clean XHTML, so it may not be suitable for search.

The problem is that the code that does this is kinda convoluted, and we end up running xdmp:tidy twice, serially. The creates an XHTML version of the original content that has had all its structure ripped out.

It's not really apparent at the user level because what we display is the original HTML. The screwed up XHTML is supposed used only for searches. I think it is why we get search snippets like this for javadoc/jsdoc hits, though:

Notice the "blahblah_html.xhtml" and the weird snippet text. The *_html.xhtml are the screwed up XHTML files, and the text is bad because those documents contain one big blob of text in the body.

It turns out that these messed up files are causing ingestion problems downstream for the global search project, so it needs to be fixed. As a side benefit, it might also improve the current search experience, too.

marklogic-community / RunDMC

javadoc/jsdoc searchable docs are malformed #801