pubpub / pubpub

Open Community Publishing
https://www.pubpub.org
GNU General Public License v2.0
475 stars 62 forks source link

Cannot import md with blockquote #1369

Open gabestein opened 3 years ago

gabestein commented 3 years ago

What went wrong, step-by-step?

  1. Tried to import the attached file
  2. Import initially succeeded, but when trying to complete it, gives the error: Error applying transaction: RangeError: Invalid content for node type blockquote
  3. This happens even stripping all tags out of the blockquote.
  4. Converting the blockquote to markdown (e.g. > Text) works, but only when you strip any other html in the blockquote.

What did you expect to happen?

It should import correctly.

2020-07-22-student-employment-academic-libraries.md

idreyn commented 3 years ago

Looked into this a little bit today. So consider this bit of HTML, interpreted as an .md file:

<blockquote><p><em>Hello</em></p></blockquote>

If you convert this to the Pandoc AST using pandoc -t json -f markdown input.md you get:

{"pandoc-api-version":[1,22],"meta":{},"blocks":[{"t":"RawBlock","c":["html","<blockquote>"]},{"t":"RawBlock","c":["html","<p>"]},{"t":"Plain","c":[{"t":"RawInline","c":["html","<em>"]},{"t":"Str","c":"Hello"},{"t":"RawInline","c":["html","</em>"]}]},{"t":"RawBlock","c":["html","</p>"]},{"t":"RawBlock","c":["html","</blockquote>"]}]}

Each opening (and closing!) HTML tag is interpreted as its own RawBlock/RawInline element. This is apparently expected behavior per the Pandoc manual:

...pandoc can process “bare” raw HTML and TeX, [but] the result is often interspersed raw elements and normal textual elements...

I don't know why anyone would want this! However, if you invoke Pandoc with -f markdown_strict instead, you get:

{"blocks":[{"t":"RawBlock","c":["html","<blockquote><p><em>Hello</em></p></blockquote>"]}],"pandoc-api-version":[1,20],"meta":{}}

which is what our importer is designed to handle — when it sees a RawBlock with type html it passes the contents wholesale into a Pandoc subprocess to be parsed and transformed. We originally imported Markdown as markdown_strict and at some point switched to markdown to gain flexibility elsewhere, and this is an unintended side effect of that change.

So the remedy is one of:

  1. Update Pandoc and see if this is fixed in a newer version. I'm afraid to do this locally lest I permanently lose my copy of the older version PubPub is pegged to, so I may try to spin up a VM. I'm not hopeful that this will solve the problem, since I get the same results on Try Pandoc which presumably is up to date.
  2. Teach the importer to parse the interspersed raw blocks that the markdown format gives us.
  3. Add a client-side UI control that lets us parse an .md file as either markdown or markdown_strict.

We need to do (1) anyway, but that's a much larger project that could consume a cycle or more. I think (2) is not really worth our time, but (3) would be easy and potentially more broadly useful.

gabestein commented 3 years ago

Discussed 5/11. Likely implementing no. 3 (or no. 3 + extensions) at some point, but not urgent. For now, the workaround is to replace block-level HTML in blockquotes with MD equivalent.