marklogic-community / RunDMC

MarkLogic application for running a developer site
http://developer.marklogic.com/code/rundmc
Other
17 stars 18 forks source link

javadoc/jsdoc searchable docs are malformed #801

Open kcoleman-marklogic opened 5 years ago

kcoleman-marklogic commented 5 years ago

When we ingest javadoc and jsdoc for docs.marklogic.com, an attempt is made to store it in both the original HTML form and as cleaned up XHTML. The idea being, I think, that what is produced by these third party apps isn't always clean XHTML, so it may not be suitable for search.

The problem is that the code that does this is kinda convoluted, and we end up running xdmp:tidy twice, serially. The creates an XHTML version of the original content that has had all its structure ripped out.

It's not really apparent at the user level because what we display is the original HTML. The screwed up XHTML is supposed used only for searches. I think it is why we get search snippets like this for javadoc/jsdoc hits, though:

image

Notice the "blahblah_html.xhtml" and the weird snippet text. The *_html.xhtml are the screwed up XHTML files, and the text is bad because those documents contain one big blob of text in the body.

It turns out that these messed up files are causing ingestion problems downstream for the global search project, so it needs to be fixed. As a side benefit, it might also improve the current search experience, too.

kcoleman-marklogic commented 5 years ago

I believe this is fixed. Going to run it on pubs for awhile to make sure I didn't invent exciting new problems.