jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.94k stars 3.34k forks source link

Add raw_xml extension for JATS reader for reading additional elements not recognized by the built-in parser #8424

Open jjallaire opened 1 year ago

jjallaire commented 1 year ago

The JATS spec is huge and while the Pandoc reader supports a broad subset, there are many elements it still might not recognize. it would be great if a raw_xml extension could be added to the reader that would preserve unrecognized elements as RawBlock / RawInline of type jats. This would in turn allow for handling these elements downstream.

jgm commented 1 year ago

Could you give some examples of currently unrecognized things?

tarleb commented 1 year ago

Index and crossref-related elements like <index-term-range-end> come to mind: those are difficult to represent in the AST.

jgm commented 1 year ago

Would it be bad if we just always included these as raw JATS (without an extension)? They will only show up in JATS output, so this should be fairly harmless, right? Unless there are elements that we want to parse one way if raw JATS is allowed, and another way otherwise.

jjallaire commented 1 year ago

Yes, it seems like always including these would be fine (and almost anyone interested in reading JATS might have use for it).

jgm commented 1 year ago

I guess one consequence is that the raw JATS bits would get converted to markdown when the raw_attribute extension is enabled. Is that bad? If so we could add an extension.

jgm commented 1 year ago

Another issue is this: the JATS reader first uses an XML parser, then parses the structure this produces. So we don't actually have access to the "raw" bits; the best we could do would be to re-render the element as text. This should work, though.

jjallaire commented 1 year ago

I think adding an extension would be good so that no unexpected side effects occur. The scenario is a custom reader/converter to turning on an extension is not a burden at all.

castedo commented 1 year ago

jjallaire wrote:

The JATS spec is huge and while the Pandoc reader supports a broad subset, there are many elements it still might not recognize.

jgm wrote:

Could you give some examples of currently unrecognized things?

I assume @jjallaire you are only talking about "pure text" JATS, like what's inside the body and abstract elements in JATS?

If we're talking about also JATS references, I'd say the big unrecognized thing is mixed-citations (see #6287).

If we're talking about everything in JATS I have many points to share. FWIW, I only have pandoc process piece of JATS, namely the "pure text" and references. For the rest I'm currently using https://github.com/elifesciences/elife-tools. I've coded up a higher level API to coordinate this frankenstein of JATS parsing in https://gitlab.com/perm.pub/epijats/ I might be moving away from elife-tools and just start calling the lxml Python XML parser directly.

I've been exploring the archived JATS XML files in PMC Open Access Subset and archived eLife JATS XML files. Parsing general JATS XML will be very challenging. I'm happy to discuss details if folks want my sense of the challenge.

jjallaire commented 1 year ago

For further context, we are exploring whether JATS might be a path to reconciling the ASTs of various scientific/technical publishing systems (e.g. Pandoc, JupyterBook, Curvenote, Quarto, NextJournal). All of these systems can produce advanced technical documents with citations, figures, cross-references, author/affiliation metadata, etc. but it's challenging to make them interoperate.

In the meantime parts of the scientific publishing industry are working towards JATS as a standard format for submission/review/archiving (e.g. see https://www.ncbi.nlm.nih.gov/books/NBK579698/). So JATS has the potential to serve a critical unifying role in the way that all of this software plays together. The questions here are not to say that Pandoc should parse everything but rather leave some technical door open for custom readers to fill in things not currently parsed.

castedo commented 1 year ago

Thanks for sharing the context. I'll email you @jjallaire JATS obsevations and my JATS plans not directly related to this github issue.

Regarding this specific github issue ... I have a strong hunch that productive JATS processing will involve reading an entire JATS XML file with an XML parser (or some library specialized for JATS). I am skeptical that parsing JATS XML files by first running the entire JATS XML file through pandoc to then only using a pandoc AST will be a sensible approach. I suspect it will just end up being a frustrating approach trying re-purpose pandoc as an XML parser. I'm happy to go into more details if anybody is curious on my reasoning.

I don't want to sound only negative here. Pandoc is awesome at so many things. I just want to warn folks about a new area where I suspect pandoc will not shine it's usual brilliance. XML is beast. JATS is a monster. :fearful: Be scared. :laughing:

So here's an idea to throw out regarding what pandoc does with an XML elements it does not recognize: just return an XML XPath to the XML element and don't have pandoc try to do anything more with that XML element. If client software wants to do something smart with those XML elements then clients should use an XML parser. The XPath returned in the pandoc AST would enable coordinating pandoc and a separate true XML parse.

jgm commented 1 year ago

Here's an issue that arises. Suppose we have unknown_element in JATS, but it can contain children that are interpretable by pandoc. What should pandoc do?

Currently in this kind of case pandoc will just ignore the unknown element and process the children.

Suppose we don't want to ignore the unknown element. If we store a pretty-printed version of the entire element in a RawBlock (Format "jats"), that means the children don't get parsed by pandoc. This will yield worse results in many cases.

Another option would be to have a RawBlock with just the rendered opening tag, then the parsed contents, then another RawBlock with just the rendered closing tag.

jjallaire commented 1 year ago

@jgm The last option you mention (distinct raw elements with the open/close tags) was exactly what I had in mind. This wouldn't interfere with the normal Pandoc parsing but would preserve the tags at exactly their spot in the tree (@castedo I think the objection to the XPath idea is that you actually want Pandoc to process the children into its AST rather than ignoring the element entirely). Note that I think it would be good to do both RawBlock (Format "jats") and RawInline (Format "jats")

A subsequent reader of the AST could then look for the begin/end raw jats elements and add additional behavior as appropriate. Including the raw jats elements in the tree when there isn't a reader that is planning on consuming them is kind of strange, which is why I do think this should be an opt-in behavior. Also note that having "unbalanced" raw tags would not be novel construction (the openxml writer currently supports unbalanced tags as do HTML and LaTeX writers).

castedo commented 1 year ago

you actually want Pandoc to process the children into its AST rather than ignoring the element entirely

My experience is that I wish pandoc would not process XML elements it does not know. It does not do a good job guessing what it should do: for example #8438.

Perhaps a good bit of the issue here is what actual elements are we talking about and what is pandoc actually going to do. In the case of #8438 pandoc should not act like it knows what to do. If we're talking hypothetically about any possible JATS XML element and whatever pandoc guesses it should do with it, then I'd say, no, you probably don't actually want pandoc processing it. If there is some situation where there is a reasonable guess, then maybe that works, but then that's not really an XML element that is unknown to pandoc.

I think a good deal of this gets into: 1) which JATS XML dialect are we talking about (JATS XML produced by pandoc? JATS XML in PubMed Central? at eLife? ...) 2) how graceful should graceful degradation be, or what is or not graceful 3) is the application something like text mining or actually producing PDF/HTML files with content that is faithful to what the authors intended to be communicated? 4) how much manual human intervention is required to tweak input JATS or output HTML/PDF to deal with incorrect processing

In my case, I probably will be unaffected by whether this feature is implemented or not. I probably won't be parsing JATS XML that pandoc does not understand. And if I was, I probably would remove those XML element from a parsed XML tree to regenerate a simpler XML file to pass to pandoc.

jjallaire commented 1 year ago

@castedo The proposal here isn't that Pandoc tries to "process" elements it doesn't know but rather leave them in place so that another processor can make more knowledgeable use of them. They would simply be left in the tree as RawInline / RawBlock and basically disappear when converting to another non-JATS format. The feature would furthermore be opt-in so no existing clients would see any change in behavior.

kamoe commented 1 year ago

@castedo Pandoc does not guess what to do with an element it doesn't know. It always skips it, and goes on to parse its children. That is what @jgm suggested earlier. Here's where that happens:

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L201

There is no guesswork involved.

In the case of https://github.com/jgm/pandoc/issues/8438, as per the JATS specs, children of <restricted-by> are always only text. The way the JATS reader parses text is to simply present it as a trimmed inline plain text. Here is where that happens, after skipping the unknown parent:

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L160-L162

There is a fundamental difference between skipping an unknown element with only text content, like in https://github.com/jgm/pandoc/issues/8438; and skipping an unknown element with known element children. In the former, you risk outputting unwanted content or removing wanted content; in the latter, skipping the unknown level, then going back in track to parse known elements is the closest thing to an ideal output, since this is not Pandoc pretending it knows what to do. It actually knows what to do; since these are known elements. (I like the suggestion to, on top of this, keeping a RawInline / RawBlock to mark a skipped unknown element, as @jjallaire suggests).

I suggested an approach for a solution for text-only children in https://github.com/jgm/pandoc/issues/8438#issuecomment-1556027479

castedo commented 1 year ago

My word choice of "guess" is a poor choice ... I guess. :sweat_smile:

This example of XHTML should be some good grounding to make sure we're talking about the same concepts:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <digestible>BARF</digestible>
    <head>
        <title>XHTML Example</title>
        <combustible>BOOM!</combustible>
    </head>
    <body>
        <ignore>Yep,</ignore>
        I am text.
    </body>
</html>

Instead of the wording "guess"/"pretend"/"know" etc.. I propose the two phrases:

For "marked-up-text" pandoc does processing as @kamoe describes above.

For unknown XML data elements like <html><head><combustible> pandoc appears to completely skip the entire element regardless of whether there is text inside and/or subelements it "knows" (outside of its context in the XML tree).

Folks can disagree on whether something like <html><digestible> should be handled as marked-up-text vs XML data.

Later I'll write a challenging question about XML parsing, JATS and pandoc.

kamoe commented 1 year ago

@castedo Keep in mind Pandoc operates with different readers for different formats.

I have not analysed the HTML reader in depth, but at a quick glance, the operation is not dissimilar to the JATS one (the same general principles of case handling and recursion seem to apply). So why I am I bringing the difference between readers about? Simply because whether an element is "known" is defined by each reader independently.

For JATS, it happens here: https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L165-L201

For HTML, it happens here: https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/HTML.hs#L178-L230

Now, there is a difference between skipping an element and ignoring an element.

Skipping an element means that there is no case defined for that element (what I would call unknown element), and the parser proceeds to parse its children. e.g. the default case:https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L201

Ignoring an element is quite the opposite: it is actually specifically acknowledging the existence of that element with a case (therefore, a known element), but telling the parser not to do anything with it, effectively stopping the recursion to children, and erasing it from the AST, and thus from the output. Needless to say this should be done only when you are absolutely sure of what you are doing. (e.g. ignoring title and label in the JATS reader created a significant issue, See https://github.com/jgm/pandoc/issues/8718):

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L185-L186

I believe your definitions of marked-up text vs XML-data are examples of the HTML reader's very own case execution of skipping and ignoring elements. Which does not map that of the JATS reader.

castedo commented 1 year ago

Ignoring an element ... parser not to do anything with it, effectively stopping the recursion to children, and erasing it from the AST, and thus from the output. Needless to say this should be done only when you are absolutely sure of what you are doing.

I agree with this under JATS <article><body>, similar to <html><body> which is marked-up text.

I do not agree with this under JATS <article><front> which, in my opinion, should be handled more like <html><head>. IMHO when pandoc is under <article><front> is should ignore by default and only skip (and parse children) when "you are absolutely sure you know what you are doing".

At the moment pandoc seems to assume it should parse <article><front> like marked-up text. It skips (and parses children) rather than ignore unrecognized subelements and then outputs a bunch of stuff into the AST that is not helpful. In this case I wish pandoc would just omit attempting to extract text from these unrecognized subelements. I can file a separate issue with more details on this if folks want.

The main reason I've been chiming in here is to caution against anybody that might be thinking parsing arbitrary XML data under <article><front> and having that show up in the pandoc AST is a good idea. I think good advice is to use a full XML parser that exposes an XML object model for <article><front> rather than pandoc with the pandoc AST for arbitrary XML data.

That said, I do see logic in this raw_xml extension suggestion when parsing <article><body> which is marked-up text.

kamoe commented 1 year ago

Just to clarify, in JATS <processing-meta> does not occur inside <front>. So any specific issue concerning the content of <processing-meta>, should be treated separately from <front>.

Now, <front> has three types of children: <article-meta>, <journal-meta>, and combinations of <def-list>, <list>, <ack>, <bio>, <fn-group>, <glossary>, and <notes>. As of today, the two metadata children are dealt with directly with their own case, and treated as what they are, metadata, that means, they are sent to the AST, but as metadata, not as output text:

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L182-L183

Regarding the other seven possible children, they used to be "unknown" children until the solution for https://github.com/jgm/pandoc/issues/8718, which incorporates specific cases for them. So in the near future, no child of <front> will be an unknown element, and skipping it to parse its children will be a fully controlled operation.

The question is, if it is the case that these seven children should not be part of the output (should be treated as metadata?) if they occur inside <front>. I am looking at elements like <glossary> and <def-list>, which seem to be important to display? The answer to this should guide wether we need to further customize the handling of these seven children, or if the standard section-like printable behavior defined in https://github.com/jgm/pandoc/issues/8718 would suffice. And that's material for a separate post.

What I'm getting at with this is, given that we have specified the behaviour of all children of <front>, i.e. none of them are "unknown", @castedo, do you feel comfortable with the original suggestion of using the rawBlock / rawInline tag for all "unknown" JATS elements?

castedo commented 1 year ago

Just to clarify, in JATS <processing-meta> does not occur inside <front>. So any specific issue concerning the content of <processing-meta>, should be treated separately from <front>.

Yep, I agree.

The question is, if it is the case that these seven children should not be part of the output (should be treated as metadata?) if they occur inside <front>. I am looking at elements like <glossary> and <def-list>, which seem to be important to display?

I do not think it is important that pandoc do any processing or handing of JATS specific metadata. I think it is fine for pandoc to completely ignore and not output, even as metadata, anything that is JATS specific. I think the only metadata that is important for pandoc to handle is highly generic universal simple document metadata that one would find in a wide assortment of other non-JATS formats. I raised the topic of pandoc metadata representing JATS metadata in #8359 which has been closed to the satisfaction of jgm and me.

When I call pandoc from https://gitlab.com/perm.pub/epijats the only pandoc variables I use are the ones for marked-up text, namely $abstract$, $title$ and $body$ and I do not use pandoc for anything else. For all other JATS metadata I use an XML parser. My advice to others is to only use pandoc for marked-up text and not use pandoc for to extract JATS specific metadata. I advice using an XML parser for JATS specific metadata that is not marked-up text.

What I'm getting at with this is, given that we have specified the behaviour of all children of <front>, i.e. none of them are "unknown", @castedo, do you feel comfortable with the original suggestion of using the rawBlock / rawInline tag for all "unknown" JATS elements?

I think my answer is more or less no if I roughly understand the jist of the question. I would not use nor would I advise others to extract unknown JATS elements via pandoc at all, with or without rawBlock / rawInline tags.

As described in #8359 there is kind-of generic document data like multiple document dates which pandoc almost handles properly but not well enough to work for JATS. Even for JATS XML that pandoc "knows" I do not recommend using pandoc for extracting JATS specific date information.

Similarly I do not recommend using the pandoc $abstract$ due to issue #8015. My advice is to use an XML parser to extract the contents of <article ...><front><article-meta><abstract> and then pass it to pandoc as just marked-up body text with options to lower the heading levels (when converting to HTML).

If anybody would like me to file more bugs on more specific ways that pandoc fails to parse "known" JATS specific metadata let me know. I haven't bothered because I consider using pandoc to extract JATS specific metadata a bad idea.

kamoe commented 1 year ago

I think my answer is more or less no if I roughly understand the jist of the question. I would not use nor would I advise others to extract unknown JATS elements via pandoc at all, with or without rawBlock / rawInline tags.

I see we stand on opposite sites of the fence here. I completely understand that, for now, some use cases are better addressed with other tools, but I also believe in gradually improving Pandoc.

Similarly I do not recommend using the pandoc $abstract$ due to issue #8015. My advice is to use an XML parser to extract the contents of <article ...><front><article-meta><abstract> and then pass it to pandoc as just marked-up body text with options to lower the heading levels (when converting to HTML).

I have suggested a solution for https://github.com/jgm/pandoc/issues/8015#issuecomment-1557406130, if at all that is useful.

If anybody would like me to file more bugs on more specific ways that pandoc fails to parse "known" JATS specific metadata let me know. I haven't bothered because I consider using pandoc to extract JATS specific metadata a bad idea.

I would be EXTREMELY interested. Please go ahead, I'll be very happy to contribute my thoughts and suggestions for solutions, if any. I am aware Pandoc is not perfect and more than a few use cases are out of scope, but I also believe there is easy room for improvement. Understanding those issues would be a very good first step.

castedo commented 1 year ago

I would not use nor would I advise others to extract unknown JATS elements via pandoc at all

I see we stand on opposite sites of the fence here.

Depends which JATS fence we're talking about. :sweat_smile:

I am dreaming of a future "Baseprint JATS" (https://github.com/singlesourcepub/baseprints) format. We are both on the same side of the Baseprint JATS fence. We would both love to see pandoc be able to handle this without the need of a separate XML parser.

At the other extreme is any number of "proprietary JATS XML" formats. For illustration let's say whatever JATS XML that Scholastica's proprietary tools generate for the Spartan Medical Research Journal. I doubt pandoc will ever catch up to all the weird hacks that many Journals or their proprietary vendors will slap into JATS XML with zero interest in sharing outside their proprietary business.

But the really interesting "JATS fences" are the different flavors of the millions of JATS XML files in the PMC Open Access Subset. There's a whole ecosystem of different species of JATS there. We probably stand on opposite sides of a few JATS fences there.

I would be EXTREMELY interested. Please go ahead, I'll be very happy to contribute my thoughts and suggestions for solutions, if any.

OK, I will file issues on Baseprint JATS metadata that https://gitlab.com/perm.pub/epijats handles via an XML parser but pandoc does not. I would love it if pandoc could make an XML parser unnecessary in epijats. This seem realistic.

One issue that comes to mind I recently filed: #8847. It is an unrecognized XML attribute in marked-up text in the body. And it's "pandoc JATS". In other words, pandoc doesn't recognize an XML attribute it itself generates, for marked-up text. It's not even JATS specific metadata.

castedo commented 1 year ago

To reduce noise on this issue I'll add issues pandoc has with parsing JATS metadata to #8359. I also list a handy (hopefully) list of names for various JATS dialects. There is a lot of ambiguity and confusion I suspect when we throw around the word JATS without really knowing which dialect we are talking about.

It's worth noting that this raw_xml extension I believe would not have helped with the issue I just created (#8865). The problem with #8865 is not the existence of an XML element that pandoc does not know. The problem is pandoc doesn't handle a well known JATS XML element (pub-date) correctly at all, for PMC JATS.