jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.16k stars 3.35k forks source link

BITS reader #7740

Open kamoe opened 2 years ago

kamoe commented 2 years ago

New BITS reader

Support for BITS XML, the book extension of JATS XML.

As part of an academic project, I am exploring ways to develop a tool to transform BITS XML into DOCX. This is relevant for the use case of academic book publishing, where XML archives of previous editions need to be transformed into DOCX for authors to work in the new edition. This is a recurrent scenario, and academic publishers spend today considerable time and money in third party conversions that could easily and efficiently be handled in house.

Since this is a scheduled project with time and deadlines assigned to it (full or partial completion by September 2022 at the latest), I will develop a version of a full or partial tool.

As per recent discussion (https://groups.google.com/g/pandoc-discuss/c/E5J9-qevSEk) this seems to be a relevant and welcome addition to Pandoc.

Alternatives I have explored OxGarage and transpect as well, and also the option of a completely standalone java tool developed from scratch. A pandoc BITS reader (and later a Pandoc BITS writer) seem to be the easiest and straightforward solution as of now.

jgm commented 2 years ago

If BITS is an extension of JATS, then it might be good to explore developing this capacity as a modification of the current JATS writer, rather than a new module. (That avoids lots of duplicated code.) Note that the JATS writer already exports several functions for different JATS variants; the same strategy could be used, perhaps, for BITS?

(Just to be clear, I wouldn't want to merge a separate BITS module if BITS is too similar to JATS; that just makes maintenance difficult going forward.)

kamoe commented 2 years ago

Absolutely, the question of reusing JATS code as much as possible is very relevant. I've been looking into this today. The thing is, I realise we cannot say that any BITS document is also a JATS document, and in that sense I think we need to still have two separate readers, if that makes sense? BITS content models do borrow from JATS models, and also expand in other ways. I am getting familiar with the code to make sense of the best ways to model this.

On Wed, 8 Dec 2021 at 16:58, John MacFarlane @.***> wrote:

If BITS is an extension of JATS, then it might be good to explore developing this capacity as a modification of the current JATS writer, rather than a new module. (That avoids lots of duplicated code.) Note that the JATS writer already exports several functions for different JATS variants; the same strategy could be used, perhaps, for BITS?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jgm/pandoc/issues/7740#issuecomment-988994898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB53G4SLCM5T6623VM6PG7LUP6FERANCNFSM5JTON77Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jgm commented 2 years ago

If BITS is strictly an extension of JATS, then they could be handled in the same reader. The reader could have something in State, for example, that tells it whether to allow BITS extensions. It could export a separate function readBits that enables this.

kamoe commented 2 years ago

If by extension we mean that a JATS XML document should pass a validation against a BITS DTD or Schema, then no, BITS is not strictly an extension of JATS. JATS elements are not a subset of BITS elements. The valid root elements are different, to start with. BITS just borrows from some JATS content models, that's all.

The NCBI describes BITS as a "JATS extension", but it's more of an intersection of content models, really.

On Wed, 8 Dec 2021 at 18:51, John MacFarlane @.***> wrote:

If BITS is strictly an extension of JATS, then they could be handled in the same reader. The reader could have something in State, for example, that tells it whether to allow BITS extensions. It could export a separate function readBits that enables this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jgm/pandoc/issues/7740#issuecomment-989099329, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB53G4UBGYKXSXBRAZWX3FTUP6SLBANCNFSM5JTON77Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jgm commented 2 years ago

Even if BITS isn't strictly a superset of JATS, it still might make sense to implement it as a variant in the JATS module -- it depends on the extent of the divergences, I guess.

Another alternative would be to extract some of the common code into an internal module.

kamoe commented 1 year ago

From what I now understand of the JATS reader (15 months after my first comment!), it seems to me that the easiest thing to do would be to just enhance the existing JATS reader (to also support BITS). Just by adding the cases:

"collection-meta" -> parseMetadata e "book-meta" -> parseMetadata e

to the parseBlock function, the reader would already start supporting BITS metadata elements without much further effort. Of course, that is just the beginning, and it will be necessary to add more than a few other cases, and further functions to fully support all essential elements, but as a quick start, that is what I would do... then I would see if it makes sense to split into two readers/common modules later on?

jgm commented 1 year ago

If BITS is basically just JATS plus a few extra elements, then I think that's definitely the way to go.

One way to handle this is to have a parameter in the JATSState that controls the "variant" -- settings could be BITS and JATS.

The reader could check this variant in places where the behavior would diverge.

The module could then export two functions, readJATS and readBITS, which set this state parameter differently but otherwise do the same thing.

kamoe commented 1 year ago

Makes sense. Just to summarize and double check the proposed approach:

1) Modify the existing JATSState to add a "variant" parameter with value BITS or JATS 2) Modify the existing readJATS function, to set the new variant to JATS 3) Write a new readBITS function derived from readJATS, and that sets the new variant to BITS 4) Modify the parseBlock function to consider additional cases to accommodate completely new BITS elements (that never occur in JATS) 5) Modify the parseBlock function to check the variant value in those cases where behavior diverge for BITS, and provide alternative/additional behavior for those cases

Am I getting this right?

kamoe commented 1 year ago

Actually, I just realized there is already a boolean "variant" parameter in the JATSState: jatsBook:

https://github.com/jgm/pandoc/blob/714be9365bee36d47a8d8456023b5e58bb547be1/src/Text/Pandoc/Readers/JATS.hs#L53-L60

Given that the JATS reader was written based on the DocBook reader, and that that spec supports both articles and books, it makes sense that boolean variant existed there (called dbBook).

In DocBook, when the document encounters book-only content, this variant is set to true:

https://github.com/jgm/pandoc/blob/509cb9b8feae6798cb77bc35637297e9301d682e/src/Text/Pandoc/Readers/DocBook.hs#L894-L895

https://github.com/jgm/pandoc/blob/509cb9b8feae6798cb77bc35637297e9301d682e/src/Text/Pandoc/Readers/DocBook.hs#L960-L961

And when dealing with article content, it is set to false:

https://github.com/jgm/pandoc/blob/509cb9b8feae6798cb77bc35637297e9301d682e/src/Text/Pandoc/Readers/DocBook.hs#L958-L959

Seems like dbBook was copied as jatsBook for JATS, but it is never used, never updated to true. This makes sense since JATS only supports articles. The first two lines of the sect function have thus no purpose, n' is always n:

https://github.com/jgm/pandoc/blob/714be9365bee36d47a8d8456023b5e58bb547be1/src/Text/Pandoc/Readers/JATS.hs#L324-L325

Until now. Seems like we could use this to start to model BITS for jatsBook = true...

kamoe commented 1 year ago

@jgm After a thorough look, I believe it is possible to have a minimal BITS-enabled reader purely by adding a few lines to the JATS reader. I think this is the simplest way to do it.

The main point is to make use of the existing jatsBook variant in the JATSState, as I explain here.

I created a first draft for you to have an idea of what I mean here: https://github.com/jgm/pandoc/pull/9016

This should already produce a decent AST from a BITS document, but it is by no means definitive. I would add a few more lines to account for a few additional BITS-only elements, via alternative treatment relying on the jatsBook boolean value. If we find that this starts to diverge significantly down the road, then we could envisage the creation of a separate BITS.hs file, but as it is now, I think it makes sense to have both formats incorporated in the one JATS.hs reader.

What do you think?

jgm commented 1 year ago

Agreed, this plan make sense.

kamoe commented 11 months ago

Update: I have written a new clean-slate PR here. This incorporates the minimal required BITS behaviours for an equivalent BITS reader (equivalent coverage to JATS, same limitations, etc). This should still be consistent with older JATS behaviours, but cannot guarantee that until I have finished the unit tests I'd like (hence still marking as draft). I will try and complete those this week, and then I think this should be in good shape for a first review.

kamoe commented 11 months ago

@jgm All Unit tests finished and passed. See my latest comment on the PR.