Support workflow to facilitate translation of Metanorma documents

ronaldtse commented 2 years ago

OGC wishes to produce a Japanese translation of the CityGML 2.0 document encoded in Metanorma. (metanorma/ogc-citygml2#1).

I thought about it and the following workflow makes most sense. The challenge is to only translate "content", not syntax.

Metanorma parses the English into a document tree
Use the Google Translate API (Ruby client) to translate while respecting structure.
- i.e. individually translate sections / sentences along structural boundaries (without breaking links, etc).
Create the Metanorma source files for the translated document, perhaps in a interleaved or identical structure manner to produce a side by side document.

This is a preliminary workflow that nonetheless require some thinking to realize.

opoudjis commented 2 years ago

Professional translators do start with automated tools, it is true; but (a) they don't finish with them, and (b) they want a workflow where they can see both. They're going to want to use their workflow tools, you can't just dump gibberish Japanese in an XML document and have them sort it out later in ASCII.

We can, per what you suggest, insert Japanese for cleanup as duplications of context clauses. But this is a very big ask, and you should be talking to the professional translator who's actually going to do this, to work out reasonable tool support.

opoudjis commented 2 years ago

And someone else is going to have to work on this.

The text interleaving would be duplicating tags, using the @lang attribute (and for things like titles, which should only appear once), the lang:[] macro. (https://www.metanorma.org/author/topics/languages/)

ronaldtse commented 2 years ago

They're going to want to use their workflow tools

"Professional" or not, this is something that is needed by someone who is translating the language. The label "professional" is a distraction.

In this case here, we are talking about standards authoring. The workflow tool for authoring standards documents by "professional standards authors" is Metanorma.

The Japanese author should be able to use Metanorma to:

Start with machine-translation without losing existing structure
Refine the translation
Publish the Japanese translation

opoudjis commented 2 years ago

Ronald, you are not understanding what I am saying.

Professional translations is emphatically NOT a distraction, if the workflow is of a translator using machine translation as a starting point to do bulk translation. Such translators will use a translation workbench tool, such as (to pick the first instance I've googled) https://www.memsource.com/translation-software/ . Such a tool will include memorised custom equivalents that the translator has keyed in, templates, technical dictionaries, and whatever else the translator has put in place to make their life easier.

A professional translator's environment is going to be that workbench. That is the environment they are going to work in. Metanorma is NOT a translation environment, it is an authoring tool, and their translation environment is going to have to integrate with Metanorma, in some way you will need to work out.

What you are proposing is to do machine translation drop in into Metanorma XML outside of the translator's workbench tool, and make them do all their refinements manually. I am telling you, professional translators will not find that adequate: you will be taking them away from their shortcuts and their technical dictionaries, which are normally integrated into their editor.

So you will need to investigate further, how translators go about translating marked up documents preserving markup in their tools. I think it is quite likely that this is a solved problem for their workbench tools; and if it is a solved problem, that is all the more reason for us to use the existing tools' way of solving the problem, rather than imposing our own solution on them. I think us doing our own solution is going to duplicate existing effort, and do a bad job of it, that such translators cannot use.

And that is why I make a point of saying "professional" translators, translators that routinely use translation workbench tools. A non-professional translator, a subject matter expert for example, will quite happily follow the workflow you propose, of refining a machine translation manually, since they don't have existing workbench tools; they'll be quite happy, for that matter, to eyeball original and machine translated target in two separate windows of an editor, rather than a more integrated environment, where they could do things like mouseover words to get dictionary lookup. And for all I know, OGC may be translating their documents in such an ad hoc way.

But if an SDO employs a professional translator, using translation tools, to do translations, then Metanorma will need to integrate with their workflow. And:

"Start with machine-translation without losing existing structure" --- their tools likely already can do that.
"Refine the translation" --- they will want to keep doing that within their existing environment.

opoudjis commented 2 years ago

OK, given that the workflow envisioned is not one of a professional translator using a workbench:

Automated translation of XML source may preserve inline tags and attributes in formatting; that's not guaranteed, and if it does not, we may need to postprocess the XML. It would be much simpler to do the translation of asciidoctor source.
Automated translation of asciidoctor source will still potentially distort inline markup.
We should do automated translations one block at a time. (Delimited in Asciidoctor by blank lines.) We cannot translate things in inline markup separate from their context (by default): words marked up in boldface for example are still part of sentences.
We should output the original in comments next to the output translated block, so that the translator can see what the original is in place, and fix things. (Both markup, and text.)
Sourcecode ([source) should not be translated. This should include inline text in monospace.
Anchors and cross-references should not be translated; we hope they will be untranslatable, but we can't assume that. So any altered <<x, and [[x]] needs to be restored in automatic translation. (In <<x,y>>, the y should be translated: it is rendered text. In <<x,clause=n,y>>, the clause=n should not be translated, it is formatting.)
Table cells need to be processed one cell at a time
Bibliographies should not be translated
Document headers should not be translated. (The title should, but it's not worth trying to parse the document header.)

ronaldtse commented 2 years ago

Automated translation of XML source may preserve inline tags and attributes in formatting; that's not guaranteed, and if it does not, we may need to postprocess the XML. It would be much simpler to do the translation of asciidoctor source.

I respectfully disagree:

We do not have a proper AsciiDoc parser that provides a parse tree suitable for translation. Any regex hack would just make the flow more fragile that it needs to be.
The XML source is the only source of truth for the Metanorma document. Remember that the model-based standards code only unrolls content in the XML source, not the AsciiDoc source.

i.e. We should use the Metanorma Semantic XML for translation purposes.

ronaldtse commented 2 years ago

The talk about "professional translators" is irrelevant to our task at hand right now.

Here are the facts:

The Japanese translation will be published in Metanorma.
The English source document is available in Metanorma AsciiDoc and XML.
The Japanese translator wishes for some machine translation assistance to start with.

We just have to do whatever possible with these.

opoudjis commented 2 years ago

Google will skip HTML but not non-HTML XML markup (behaviour varies between languages).

Serialising the Asciidoctor parse tree into pseudo-HTML is itself a major venture, requiring a new parser, and the Asciidoctor parse tree cannot be relied on as stable.

The alternative is likely going to be quite lossy: source Asciidoctor > source Metanorma XML > source Metanorma Pseudo-HTML (substituting arbitrary HTML tags for Metanorma tags) > translated Metanorma Pseudo-HTML > translated Metanorma XML > translated Asciidoctor

Indeed, it'll be lossy enough that any translator is going to need to have two text windows side by side, source Asciidoctor, and output Asciidoctor --- and they're going to have to do a lot of repair of the latter copying from the former. If the document is clean (not much markup), this might be good enough. It's not a given that it will.

In XML, the provisos above become:

Content of <sourcecode>
Content of <code>
In the case of <xref>, content can be translated: the anchors and anchor cross-references will be segregated as markup.
<td> and <th> content needs to be translated in isolation
<references> need not to be translated
<bibdata> needs not to be translated (with the possible exception of <title> and <abstract>)

opoudjis commented 2 years ago

Unassigning myself, I won't have time to do this, and I've outlined what needs doing

ronaldtse commented 2 years ago

I found that LibreTranslate is a pretty good model that can be run locally.

opoudjis commented 11 months ago

DeepL also

ghobona commented 11 months ago

Discussed with OGC Staff on 2023-11-06.

More research needed before identifying a path forward.

opoudjis commented 7 months ago

Processing the input text in Asciidoctor format using coradoc is a more effective way forward.

ghobona commented 6 months ago

@opoudjis to look into providing an example from some prior work.

opoudjis commented 6 months ago

The work is the samples of metanorma-jis that we have done, just to show that we support i18n for Japanese. The documents are jis-z-5999 and jis-z-8301-2019. Gobe would like to show these to his Japanese colleagues as proof of concept, but only if they are public documents. @ronaldtse please clarify status of documents.

opoudjis commented 6 months ago

Compiled an OGC standard using the JIS flavour of Metanorma with Japanese language for metalanguage, and sent to @ghobona as proof of concept.

metanorma / metanorma-ogc

Support workflow to facilitate translation of Metanorma documents #349