metanorma / metanorma-standoc

Metanorma for Standoc documents
BSD 2-Clause "Simplified" License
5 stars 2 forks source link

Adopted documents: support "stitch" command for "alongside wrap" of two languages #742

Open ronaldtse opened 1 year ago

ronaldtse commented 1 year ago

This issue is to support "alongside wrap" detailed at https://github.com/metanorma/metanorma-bsi/issues/2#issuecomment-852269964

Alongside wrap. A wraps B. B is unmodified. A provides additional content C where C corresponds to a transformation of B (e.g. a translation of B). A provides additional content outside of B and C.

The aligning/stitching command is called stitch below. This is all speculative so the actual commands can differ in real usage.

Alongside wrap

en-iso-44001-english.adoc

= EN ISO 44001
:lang: english

.EN Foreword
...

en-iso-44001-estonian.adoc

= EN ISO 44001
:lang: estonian

.EN Foreword in Estonian
...

evs-en-iso-44001.adoc

= EVS EN ISO 44001

.EVS Foreword
...

stitch::[en-iso-44001-english.adoc,en-iso-44001-estonian.adoc]

Images

Cover:

Screenshot 2023-02-14 at 12 47 03 AM

National foreword with side by side translation:

Screenshot 2023-02-14 at 12 47 13 AM

European cover:

Screenshot 2023-02-14 at 12 47 31 AM

TOC side by side:

Screenshot 2023-02-14 at 12 47 46 AM

European foreword side by side:

Screenshot 2023-02-14 at 12 48 39 AM

Content side by side:

Screenshot 2023-02-14 at 12 48 56 AM

Annex with Estonian table first:

Screenshot 2023-02-14 at 12 49 22 AM

Table continue in Estonian:

Screenshot 2023-02-14 at 12 50 08 AM

English table:

Screenshot 2023-02-14 at 12 50 25 AM

Bibliography with heading in 2 languages, content only English:

Screenshot 2023-02-14 at 12 51 07 AM

Index in Estonian:

Screenshot 2023-02-14 at 12 51 54 AM

Index in English:

Screenshot 2023-02-14 at 12 52 02 AM

Back cover:

Screenshot 2023-02-14 at 12 52 14 AM

Originally posted by @ronaldtse in https://github.com/metanorma/metanorma-bsi/issues/2#issuecomment-852269964

opoudjis commented 1 year ago

So, stitch:[] is a minimally thought out request to embed two documents simultaneously, where the second document is a translation of the first, and the alignment between the two is to be realised by magic.

It will not be realised by magic. Of the features specified in https://github.com/metanorma/metanorma-standoc/issues/420, the multilingual-rendering attributes will still need to be inserted into the two stitched documents, as will :align-cross-elements: : these documents will be marked up for bilingual alignment. At most, the stitching will assume that the clause structure of the two documents is identical, and where it is, it will insert tag in the two corresponding clauses to line them up explicitly.

ronaldtse commented 1 year ago

"Magic" is defined here for creating a element correspondence.

The element correspondence between two languages is either manually encoded (e.g. anchors "id1@en" matches "id2@jp") or automatically matched according to sequence.

There are cases of exceptions such as:

opoudjis commented 1 year ago

I will get some bioinformatics algorithm or other to do best-case match between the two sequences

opoudjis commented 1 year ago

This is going to have to be processed as a collection:

What we actually want here is a preprocessing mode in collection processing, which

opoudjis commented 1 year ago

It appears multilingual-rendering to date has been implemented for rendering only for JCGM.

opoudjis commented 1 year ago

I am not very enthusiastic about this, but I'm going to implement this in metanorma gem as a postprocessing of Presentation XML. There is code from @Intelligent2013 in JCGM XSLT to handle these tags, but I need to generalise this to HTML and DOC anyway. I will be following his code to arrange elements, including his use of the cross-align element. In fact, I'm going to try and use his XSLT in preprocessing.

opoudjis commented 1 year ago

In collections processing, we really need to do without generating PDF of the individual documents; they will not be reused, and are just dead time for document compilation.

opoudjis commented 1 year ago

The JCGM XSLT has a model of iterating through the first document in the collection, as a master, and all other documents in the collection, as (slaves) ahem, dependents. We cannot do that, because the first document is likeliest to be a preface: we will need markup in the manifest on the status of each document with bilingual alignment.

Elements to be aligned are rendered inside <cross-align/>, which is populated as an XSL:FO table in JCGM. We will retain that, and process cross-align in HTML (and DOC?)

ronaldtse commented 1 year ago

@opoudjis the only true JCGM bilingual document is JCGM 200:

JCGM 100 is in both English and French but they are published separately.

There is ISO 2533 ADD 2 that is Trilingual and presented in three columns:

Screenshot 2023-04-12 at 4 03 33 PM

But it is not yet encoded:

opoudjis commented 1 year ago

I am making up a bilingual out of JCGM 100 at the moment, to see how far I can get in reusing Alex's XSLT, and I will be tinkering with that document.

The JCGM 200 document alternating between one and two columns is irritating, but sadly realistic.

opoudjis commented 1 year ago

The excerpted XSLT works (though it is generating an XSL:FO table, and not the <cross-align><align-cell></align-cell><align-cell></align-cell></cross-align> I want to end up with). Its performance is abysmal, because libxslt's node-set() is so much slower than xalan:nodeset() . But this is not a concern for me, as I will simply be running this in Ruby with Nokogiri, and each nodeset is in fact a single document in Nokogiri, which I will keep around as a variable.

opoudjis commented 1 year ago

Performance is better but not great: 47 sec for JCGM 100 (12 MB) to move text in place in parallel columns. Parking code in PR, not yet rendered.

Both I and @Intelligent2013 will need to render Presentation XML cross-align/align-cell into parallel table cells. I will need to do so in a single HTML file (since that is what parallel columns ends up requiring.)

opoudjis commented 1 year ago

For <cross-align> rendering to work in HTML, we are finally going to have to bite the bullet and parse whatever is in Presentation XML in sequence, rather than by query. cross-align takes priority over clauses: it contains them.

opoudjis commented 1 year ago

The proof of concept collection is JGCM 100 EN + FR. Am getting bi-column output, but it needs a lot of care, and parsing now needs to be a lot more dumb in just spitting out what it receives in Presentation XML, rather than being opinionated.

Archive.zip