Cannot import md with blockquote

gabestein commented 3 years ago

What went wrong, step-by-step?

Tried to import the attached file
Import initially succeeded, but when trying to complete it, gives the error: Error applying transaction: RangeError: Invalid content for node type blockquote
This happens even stripping all tags out of the blockquote.
Converting the blockquote to markdown (e.g. > Text) works, but only when you strip any other html in the blockquote.

What did you expect to happen?

It should import correctly.

2020-07-22-student-employment-academic-libraries.md

idreyn commented 3 years ago

Looked into this a little bit today. So consider this bit of HTML, interpreted as an .md file:

<blockquote><p><em>Hello</em></p></blockquote>

If you convert this to the Pandoc AST using pandoc -t json -f markdown input.md you get:

{"pandoc-api-version":[1,22],"meta":{},"blocks":[{"t":"RawBlock","c":["html","<blockquote>"]},{"t":"RawBlock","c":["html","<p>"]},{"t":"Plain","c":[{"t":"RawInline","c":["html","<em>"]},{"t":"Str","c":"Hello"},{"t":"RawInline","c":["html","</em>"]}]},{"t":"RawBlock","c":["html","</p>"]},{"t":"RawBlock","c":["html","</blockquote>"]}]}

Each opening (and closing!) HTML tag is interpreted as its own RawBlock/RawInline element. This is apparently expected behavior per the Pandoc manual:

...pandoc can process “bare” raw HTML and TeX, [but] the result is often interspersed raw elements and normal textual elements...

I don't know why anyone would want this! However, if you invoke Pandoc with -f markdown_strict instead, you get:

{"blocks":[{"t":"RawBlock","c":["html","<blockquote><p><em>Hello</em></p></blockquote>"]}],"pandoc-api-version":[1,20],"meta":{}}

which is what our importer is designed to handle — when it sees a RawBlock with type html it passes the contents wholesale into a Pandoc subprocess to be parsed and transformed. We originally imported Markdown as markdown_strict and at some point switched to markdown to gain flexibility elsewhere, and this is an unintended side effect of that change.

So the remedy is one of:

Update Pandoc and see if this is fixed in a newer version. I'm afraid to do this locally lest I permanently lose my copy of the older version PubPub is pegged to, so I may try to spin up a VM. I'm not hopeful that this will solve the problem, since I get the same results on Try Pandoc which presumably is up to date.
Teach the importer to parse the interspersed raw blocks that the markdown format gives us.
Add a client-side UI control that lets us parse an .md file as either markdown or markdown_strict.

We need to do (1) anyway, but that's a much larger project that could consume a cycle or more. I think (2) is not really worth our time, but (3) would be easy and potentially more broadly useful.

gabestein commented 3 years ago

Discussed 5/11. Likely implementing no. 3 (or no. 3 + extensions) at some point, but not urgent. For now, the workaround is to replace block-level HTML in blockquotes with MD equivalent.

pubpub / pubpub

Cannot import md with blockquote #1369

What went wrong, step-by-step?

What did you expect to happen?