metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 4 forks source link

Command to output the content in selected language #127

Open kwkwan opened 2 months ago

kwkwan commented 2 months ago

Assume I have content in both Japanese and English looks like the following:

[format=asciidoc]

[lang=jp]
== Content

居住その他の目的をもって構築された建築物。

普通建物、堅ろう建物、普通無壁舎及び堅ろう無壁舎に区分する。

普通建物とは、3階未満の建物及び3階以上の木造等で建築された建物をいう。

堅ろう建物とは、鉄筋コンクリート等で建築された建物で、地上3階以上又は3階相当以上の高さのものやスタンドを備えた競技場をいう。

普通無壁舎とは、側壁のない建物、温室及び工場内の建物類似の構築物で、3階未満のものをいう。

堅ろう無壁舎とは、鉄筋コンクリート等で建築された側壁のない建物及び建物類似の構築物で、地上3階以上又は3階相当以上の高さのものをいう。

[.source]
<<gsi_ops,annex=7,付録7 公共測量標準図式>>

.bldg:Buildingの例
image::images/041.webp.png[]

LOD0からLOD3 までは、建築物の屋外の形状を表現する。
LOD4では、建築物の屋外の形状に加え、屋内の形状を表現する。

[lang=en]
== Content

Buildings constructed for residential and other purposes.

They are classified into ordinary buildings, reinforced buildings, ordinary open buildings, and reinforced open buildings.

An ordinary building refers to a building with less than 3 floors or a building constructed with wood or similar materials with 3 or more floors.

A reinforced building refers to a building constructed with reinforced concrete or similar materials, with a height of 3 floors or more or a stadium with stands.

An ordinary open building refers to a building without side walls, such as a greenhouse or a building inside a factory, with less than 3 floors.

A reinforced open building refers to a building without side walls, such as a building constructed with reinforced concrete or similar materials, with a height of 3 floors or more or a building similar to a stadium.

[.source]
<<gsi_ops, annex=7, Appendix 7 Public Survey Standard Diagram>>

.bldg: Example of a Building
image::images/041.webp.png[]

LOD0 to LOD3 represent the outdoor shape of the building.
LOD4 represents both the outdoor shape of the building and the indoor shape.

I would like to have a command to selectively output the content like:

content.output(lang: :jp)
ReesePlews commented 2 months ago

hello @kwkwan this looks very similar to the template i have been using to paste in the content to the "Notes property" ; what is == Content is this an asciidoc heading or a special indicator that "content" is to follow?

do we need to support the case that only one language is present, and in that case would the [lang xx] indicator still be required? just wondering

ronaldtse commented 2 months ago

I have discussed with @opoudjis a few days ago about this syntax.

There are basically 2 ways we can embed AsciiDoc content in an external plain text field:

  1. As an individual AsciIDoc document.
  2. As an AsciiDoc "snippet" (i.e. without attributes and title).

In addition, the header line of [format=asciidoc] is also unsupported in AsciiDoc. We currently don't have a plain text syntax that differentiates between AsciiDoc and other rich text formats. We need to come up with something?

Individual AsciiDoc document

Sample:

= Content (title is unnecessary)

[lang=ja]
== Content (clause heading is unnecessary)

(Japanese content)

[lang=en]
== Content (clause heading is unnecessary)

(English content)

AsciiDoc snippet

Sample

[lang=ja]
--
(Japanese content)
--

[lang=en]
--
(English content)
--

Identifier for encoding of rich text

We need to come up with a "magic format recognizing phrase" that differentiates AsciiDoc from other rich text formats.

If we use approach 1, then the initial = ... can differentiate against # ... (Markdown).

The other approach being used is the YAML front matter format that is used by Jekyll, i.e.

---
format: asciidoc
---
= blah
...

Next

Approach 1 is definitely easier to implement because the grammar is well defined. This is the "give me all" approach, and for elements that are not supported in the user's case (e.g. if this is to be inserted in a table cell), the user will have to tailor the content to fit.

Approach 2 requires defining a "AsciiDoc profile", possibly via #74, but enumerating the possible values can be difficult. However, it will save the user from the trouble in filtering out unsupported elements.

I think both Approach 1 and 2 tell us that #74 is absolutely necessary for this task, either way.

Thoughts @hmdne ?

hmdne commented 2 months ago

We could add a filter system to Coradoc, which would convert a Coradoc tree before outputting the result. Then we could use that in command line like so: --filter=language:pt_BR which would filter out all the nodes that are of a different language than pt_BR. Similarly we could provide an API, for instance by extending Coradoc::Converter: https://github.com/metanorma/coradoc/blob/main/lib/coradoc/converter.rb . This should be straightforward.

We currently don't have a plain text syntax that differentiates between AsciiDoc and other rich text formats

We have file extensions and/or mime types. Anything more, I'm afraid, will not be used by majority of documents. This kind of feels like UTF-8 BOM to me. If we want to deduce a type of some embedded text, I'd rather suggest extending the format of embedding to contain a mime type.

opoudjis commented 2 months ago

We currently don't have a plain text syntax that differentiates between AsciiDoc and other rich text formats

We have file extensions and/or mime types. Anything more, I'm afraid, will not be used by majority of documents. This kind of feels like UTF-8 BOM to me. If we want to deduce a type of some embedded text, I'd rather suggest extending the format of embedding to contain a mime type.

We've recently done away with it as unused, but the Relaton grammar had MIME types for text it would find (and it never used them): Metanorma XML and Asciidoc, of course, are custom formats.

{ ( "text/plain" | "text/html" | "application/docbook+xml" |
"application/tei+xml" | "text/x-asciidoc" | "text/markdown" | "application/x-metanorma+xml" | text ) }