Command to output the content in selected language

kwkwan commented 2 months ago

Assume I have content in both Japanese and English looks like the following:

[format=asciidoc]

[lang=jp]
== Content

居住その他の目的をもって構築された建築物。

普通建物、堅ろう建物、普通無壁舎及び堅ろう無壁舎に区分する。

普通建物とは、3階未満の建物及び3階以上の木造等で建築された建物をいう。

堅ろう建物とは、鉄筋コンクリート等で建築された建物で、地上3階以上又は3階相当以上の高さのものやスタンドを備えた競技場をいう。

普通無壁舎とは、側壁のない建物、温室及び工場内の建物類似の構築物で、3階未満のものをいう。

堅ろう無壁舎とは、鉄筋コンクリート等で建築された側壁のない建物及び建物類似の構築物で、地上3階以上又は3階相当以上の高さのものをいう。

[.source]
<<gsi_ops,annex=7,付録7 公共測量標準図式>>

.bldg:Buildingの例
image::images/041.webp.png[]

LOD0からLOD3 までは、建築物の屋外の形状を表現する。
LOD4では、建築物の屋外の形状に加え、屋内の形状を表現する。

[lang=en]
== Content

Buildings constructed for residential and other purposes.

They are classified into ordinary buildings, reinforced buildings, ordinary open buildings, and reinforced open buildings.

An ordinary building refers to a building with less than 3 floors or a building constructed with wood or similar materials with 3 or more floors.

A reinforced building refers to a building constructed with reinforced concrete or similar materials, with a height of 3 floors or more or a stadium with stands.

An ordinary open building refers to a building without side walls, such as a greenhouse or a building inside a factory, with less than 3 floors.

A reinforced open building refers to a building without side walls, such as a building constructed with reinforced concrete or similar materials, with a height of 3 floors or more or a building similar to a stadium.

[.source]
<<gsi_ops, annex=7, Appendix 7 Public Survey Standard Diagram>>

.bldg: Example of a Building
image::images/041.webp.png[]

LOD0 to LOD3 represent the outdoor shape of the building.
LOD4 represents both the outdoor shape of the building and the indoor shape.

I would like to have a command to selectively output the content like:

content.output(lang: :jp)

ReesePlews commented 2 months ago

hello @kwkwan this looks very similar to the template i have been using to paste in the content to the "Notes property" ; what is == Content is this an asciidoc heading or a special indicator that "content" is to follow?

do we need to support the case that only one language is present, and in that case would the [lang xx] indicator still be required? just wondering

ronaldtse commented 2 months ago

I have discussed with @opoudjis a few days ago about this syntax.

There are basically 2 ways we can embed AsciiDoc content in an external plain text field:

As an individual AsciIDoc document.
As an AsciiDoc "snippet" (i.e. without attributes and title).

In addition, the header line of [format=asciidoc] is also unsupported in AsciiDoc. We currently don't have a plain text syntax that differentiates between AsciiDoc and other rich text formats. We need to come up with something?

Individual AsciiDoc document

Pros: self-contained, well-defined, therefore validate-able.
Cons: need to provide unnecessary content, such as title.

Sample:

= Content (title is unnecessary)

[lang=ja]
== Content (clause heading is unnecessary)

(Japanese content)

[lang=en]
== Content (clause heading is unnecessary)

(English content)

AsciiDoc snippet

Pros: No need to provide unnecessary content.
Cons: need a mechanism to well-define supported grammar, in order to support validation.

Sample

[lang=ja]
--
(Japanese content)
--

[lang=en]
--
(English content)
--

Identifier for encoding of rich text

We need to come up with a "magic format recognizing phrase" that differentiates AsciiDoc from other rich text formats.

If we use approach 1, then the initial = ... can differentiate against # ... (Markdown).

The other approach being used is the YAML front matter format that is used by Jekyll, i.e.

---
format: asciidoc
---
= blah
...

We could add a filter system to Coradoc, which would convert a Coradoc tree before outputting the result. Then we could use that in command line like so: --filter=language:pt_BR which would filter out all the nodes that are of a different language than pt_BR. Similarly we could provide an API, for instance by extending Coradoc::Converter: https://github.com/metanorma/coradoc/blob/main/lib/coradoc/converter.rb . This should be straightforward.

We currently don't have a plain text syntax that differentiates between AsciiDoc and other rich text formats

We have file extensions and/or mime types. Anything more, I'm afraid, will not be used by majority of documents. This kind of feels like UTF-8 BOM to me. If we want to deduce a type of some embedded text, I'd rather suggest extending the format of embedding to contain a mime type.

opoudjis commented 2 months ago

We currently don't have a plain text syntax that differentiates between AsciiDoc and other rich text formats

We have file extensions and/or mime types. Anything more, I'm afraid, will not be used by majority of documents. This kind of feels like UTF-8 BOM to me. If we want to deduce a type of some embedded text, I'd rather suggest extending the format of embedding to contain a mime type.

We've recently done away with it as unused, but the Relaton grammar had MIME types for text it would find (and it never used them): Metanorma XML and Asciidoc, of course, are custom formats.

{ ( "text/plain" | "text/html" | "application/docbook+xml" |
"application/tei+xml" | "text/x-asciidoc" | "text/markdown" | "application/x-metanorma+xml" | text ) }

metanorma / coradoc