metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 2 forks source link

HTML document work #61

Closed hmdne closed 1 month ago

hmdne commented 1 month ago

This branch is aiming to be able to convert a HTML file from metanorma/reverse_adoc#90.

Metanorma PR checklist

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 97.44246% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 98.46%. Comparing base (defb04a) to head (d8963e8). Report is 13 commits behind head on main.

Files Patch % Lines
lib/coradoc/reverse_adoc/html_converter.rb 87.50% 8 Missing :warning:
lib/coradoc/reverse_adoc/converters/table.rb 97.97% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main metanorma/reverse_adoc#61 +/- ## ========================================== + Coverage 96.67% 98.46% +1.78% ========================================== Files 42 46 +4 Lines 1054 1306 +252 ========================================== + Hits 1019 1286 +267 + Misses 35 20 -15 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ronaldtse commented 1 month ago

These tasks will be necessary for the task:

hmdne commented 1 month ago

I use AsciiDoctor to round-trip a document. This is one of the first issues I found that turned out to be an issue with AsciiDoctor actually (unless I am mistaken and this is not possible in AsciiDoc):

https://github.com/asciidoctor/asciidoctor/issues/4595

Anyway, the document round trips successfully at this point, though there are still a lot of issues remaining.

ronaldtse commented 1 month ago

That's fine. We will need to ensure we test Coradoc against AsciiDoctor behavior.

Coradoc is meant to be a replacement to AsciiDoctor:

ronaldtse commented 1 month ago

I use AsciiDoctor to round-trip a document. This is one of the first issues I found that turned out to be an issue with AsciiDoctor actually (unless I am mistaken and this is not possible in AsciiDoc):

asciidoctor/asciidoctor#4595

A normal AsciiDoctor table cell is plain text only. To allow the image in a table cell you need to specify as an "AsciiDoc table cell".

[cols="1,1"]
|===
|cell1
a|image::images/004.webp["",200,100]
|===
hmdne commented 1 month ago

I just realized this was a bogus issue report, and it's an issue on our side actually.

ronaldtse commented 1 month ago

Let's gather up any questions within Coradoc first and the team will answer any questions so we don't affect others' repositories.

cc @opoudjis @Intelligent2013 @manuelfuenmayor @anermina

hmdne commented 1 month ago

6c4a059 makes it so that tables are now computed correctly (mostly, still in testing).

This makes the following fragment:

image

Being roundtripped into:

image

What's apparent is a difference between the column widths (I add to a table an attribute cols="3*", for instance), which makes the resulting HTML syntax having predefined column widths. The original document just relies on a web browser to deduce column widths. I have found no way to disable this behavior.

Another difference is a lack of BGCOLOR. Should I pass this attribute along? Perhaps when some setting is enabled?

hmdne commented 1 month ago

After this commit, the document is mostly readable in my opinion. There are still some crucial issues that I can see, but the document is now, let's say, testable.

Note: I still haven't implemented --split-sections option, so there's just a single .adoc file being output.

Below is an archive that contains an adoc file created using this branch and also a html file that is a result of AsciiDoctor processing of that file: document.tar.gz

ronaldtse commented 1 month ago

Thanks @hmdne , this is respectable progress!

The only thing is that the document is to be tested using Metanorma, not AsciiDoctor. The sample document for that is in the mn-samples-plateau repository (001-v3 is the v3 of this document, the new HTML version is 001-v4)

This HTML document was developed to adhere to Metanorma styling.

hmdne commented 1 month ago

@ronaldtse Thanks for clarification. I will take a deeper look at how they compare. For now, I need to work a little bit more on tables, so that we will produce necessarily correct AsciiDoc output.

hmdne commented 1 month ago

@ronaldtse A question - this document is not necessarily a semantic HTML, it sometimes uses styling. For instance:

Instead of <h2> it does <div class="subtitledata">. Instead of <th> it does <td BGCOLOR="#dddddd">

Creating a proper document won't be possible with that in mind. We can't add exceptions like this to reverse_adoc logic, since this is internal to just this document and its styling (or should we? I think the purpose of reverse_adoc is to be agnostic to formats). Otherwise, we will need to add a script to preprocess it and perhaps even postprocess it if Metanorma-compatible content is desired. Can you perhaps provide us some hints on that? (As in, is it a scope of this task, in which repo should such pre/postprocessors land, etc.)

hmdne commented 1 month ago

The last commit was quite a challenge, but it provides us a way to handle tables like this:

image

The marked cell... doesn't exist, so we need to add it. Other than that, AsciiDoctor in particular, complained about various tables having wrong colspan/rowspan values (making it go farther than the end of the table) thereby losing some data. All that, while browsers display such a table as intended (assuming a mistake and transparently correcting it).

This particular table is now compiled to the following: image

And rendered as such: image

There are still some minor issues, but we are much closer to the final stages now.

ronaldtse commented 1 month ago

Creating a proper document won't be possible with that in mind. We can't add exceptions like this to reverse_adoc logic, since this is internal to just this document and its styling (or should we? I think the purpose of reverse_adoc is to be agnostic to formats). Otherwise, we will need to add a script to preprocess it and perhaps even postprocess it if Metanorma-compatible content is desired. Can you perhaps provide us some hints on that? (As in, is it a scope of this task, in which repo should such pre/postprocessors land, etc.)

I wonder if we can provide a configuration for "header mapping" using CSS selection syntax, e.g.:

convert({
  clause_level_1: '.h1',
  clause_level_2: 'div.subtitledata',
  # ...
})

This is a problem right now that there are no semantically encoded clauses (==, === ...).

ronaldtse commented 1 month ago

@hmdne have you seen https://github.com/metanorma/mn-samples-plateau/tree/main/sources/001-v3 ? This is the v3 version of the document stated in this document (the link give you v4).

You can see how the document is encoded and structured.

In any case, there are 2 layers of implementation:

  1. Generic HTML => AsciIDoc/Metanorma
  2. Additional conversion routines for this particular document (there are two documents: 001-v4, 002-v4

I think we need to allow overrides to the conversion process, such as hooks, so users can convert their documents properly.

hmdne commented 1 month ago

I wonder if we can provide a configuration for "header mapping" using CSS selection syntax, e.g.:

convert({
  clause_level_1: '.h1',
  clause_level_2: 'div.subtitledata',
  # ...
})

This is a problem right now that there are no semantically encoded clauses (==, === ...).

For some documents this will likely suffice, but some may need Turing-complete environments. I will push in a moment a new commit which would be my proposal.

ronaldtse commented 1 month ago

(Generic AsciiDoc) Treatment of href links without reference text

Now:

https://www.geospatial.jp/iur/codelists/3.0/Appearance_mimeType.xml[https://www.geospatial.jp/iur/codelists/3.0/Appearance_mimeType.xml]

Should just be:

https://www.geospatial.jp/iur/codelists/3.0/Appearance_mimeType.xml

In AsciiDoc, if there is web link, it automatically becomes a link object.

(Generic AsciiDoc) Clause encoding

Now

[[toc4_22]]
4.22 アピアランスモデルの応用スキーマ

[[toc4_22_04]]
4.22.4 アピアランスモデルで使用するコードリストと列挙型

Should be:

[[toc4_22]]
=== アピアランスモデルの応用スキーマ

[[toc4_22_04]]
==== アピアランスモデルで使用するコードリストと列挙型

(This document) TOC should be removed

Now:

CONTENTS

<<toc0_01>>
はじめに

<<toc0_02>>
改定の概要

<<toc1>>
1 概覧

<<toc1_01>>
1.1 製品仕様の作成情報

<<toc1_02>>
1.2 目的

Should be stripped because it is supposed to be dynamically generated.

(Generic AsciIDoc) List levels and list embedding

Screenshot 2024-05-24 at 2 44 23 PM

Now:

[[toc0_02]]
改定の概要
...

2023/4/7発行 3D都市モデル標準製品仕様書 第3.0版

* 2022年度は、以下の観点により2021年度の製品仕様を拡張し、標準製品仕様を改定した。

1. 地物の拡充

* 都市空間の地物の網羅性を高めるため、「鉄道」、「徒歩道」、「広場」、「航路」、「橋梁」、「トンネル」、「その他の構造物」、「地下埋設物」、「地下街」、「水部」及び「区域」を追加した。

2. LOD(Level Of Detail:詳細度)の拡大及び精緻化

* 「建築物」のLOD4を追加した。なお、LOD4の定義はBIMモデルの国際標準であるIFCとの整合させた。

* 各地物のLODの定義の記述方法を統一し、その内容を精緻化した(定義自体に変更はない)。

3. 引用する仕様(i-UR)の更新

* 2022年度の標準製品仕様は、i-UR第3.0版(i-UR3.0)を採用する。i-UR3.0は、2022年度のProject PLATEAUの検討成果が反映され、改定されたものである。

Should be:

[[toc0_02]]
== 改定の概要

...

2023/4/7発行 3D都市モデル標準製品仕様書 第3.0版

* 2022年度は、以下の観点により2021年度の製品仕様を拡張し、標準製品仕様を改定した。

.. 地物の拡充

*** 都市空間の地物の網羅性を高めるため、「鉄道」、「徒歩道」、「広場」、「航路」、「橋梁」、「トンネル」、「その他の構造物」、「地下埋設物」、「地下街」、「水部」及び「区域」を追加した。

.. LOD(Level Of Detail:詳細度)の拡大及び精緻化

*** 「建築物」のLOD4を追加した。なお、LOD4の定義はBIMモデルの国際標準であるIFCとの整合させた。

*** 各地物のLODの定義の記述方法を統一し、その内容を精緻化した(定義自体に変更はない)。

.. 引用する仕様(i-UR)の更新

*** 2022年度の標準製品仕様は、i-UR第3.0版(i-UR3.0)を採用する。i-UR3.0は、2022年度のProject PLATEAUの検討成果が反映され、改定されたものである。
hmdne commented 1 month ago

I have taken a look at the example and further I will try to make it give a similar output. For now, let me send an updated generated document:

document.tar.gz

ronaldtse commented 1 month ago

(Generic AsciiDoc) Table headers

Screenshot 2024-05-24 at 2 46 23 PM

Now:

[[toc1_04]]
1.4 引用規格等

標準製品仕様書は、以下の規格、規程及び仕様書を引用する。

表 1-1 標準製品仕様書が引用する規格等

[cols=2*]
|===
a| 

文書名

a| 

URL

a| 

Data Encoding Specification of i-Urban Revitalization -Urban Planning ADE- ver.3.0(内閣府地方創生推進事務局)

a| 

https://www.chisou.go.jp/tiiki/toshisaisei/itoshisaisei/iur/index.html[https://www.chisou.go.jp/tiiki/toshisaisei/itoshisaisei/iur/index.html]

Should be:

[[toc1_04]]
=== 引用規格等

標準製品仕様書は、以下の規格、規程及び仕様書を引用する。

.標準製品仕様書が引用する規格等
[cols=2*]
|===
a| 文書名 a| URL

a|
Data Encoding Specification of i-Urban Revitalization -Urban Planning ADE- ver.3.0(内閣府地方創生推進事務局)

a|
https://www.chisou.go.jp/tiiki/toshisaisei/itoshisaisei/iur/index.html
hmdne commented 1 month ago

Thank you for a review @ronaldtse

In the case of this document, I found an interesting case. The document actually contains data for column widths. We could make use of them.

In any case, I'm uploading the newest version of the document: document.tar.gz

hmdne commented 1 month ago

To the preprocessor part, I think about adding a system for adding hooks. Doing something like, adding (or associating) a proc to a Nokogiri node that would be called pre/post the Nokogiri->Coradoc stage and allow for using AsciiDoc features that have no apparent equivalent in HTML.

ronaldtse commented 1 month ago

Providing Procs when encountering a node in a tree traversal will work.

It's a separate issue but if we can generalize the XML adapter (such as based the Plurimath extracted XML adapters) it will be good. Maybe we should just have a generalized XML interface (a separate gem) to allow us to use different XML adapters.

hmdne commented 1 month ago

Providing Procs when encountering a node in a tree traversal will work.

It's a separate issue but if we can generalize the XML adapter (such as based the Plurimath extracted XML adapters) it will be good. Maybe we should just have a generalized XML interface (a separate gem) to allow us to use different XML adapters.

I have commented in the past on the issue metanorma/coradoc#90 - the abstraction in Plurimath is to made Oga behave like Ox, which does some weird and hacky things (regarding whitespace handling) on which Plurimath depends. If we were to make the same here, we would need to do reverse, as ReverseAdoc relies on Nokogiri semantics (which should be the same as Oga semantics). I missed one thing in that comment though - Plurimath works with XML, reverse_adoc works with HTML, so Ox (an XML parser) won't be a good fit here, except with XHTML documents.

Today, I think I will rename Processor into Plugin, allowing multiple plugins to be used in a single conversion and with some DSL helping some common cases. Also adding the first iteration of the hooks architecture I need to do to complete this task.

ronaldtse commented 1 month ago

@hmdne indeed regarding the difference of parsers between HTML and XML.

The interesting thing is that we plan for Coradoc to have HTML and XML output, so this notion of an adapter will fit anyway. It will also benefit our other projects like Plurimath that utilize XML and HTML for parsing. Shale, has a notion of XML adapters which we could also reference, for this new extracted gem.

hmdne commented 1 month ago

@ronaldtse Handling lists was very tricky, but it's ready now. I have also uncovered something like a definition list in 7.2.4, but since their use of markup (.text2data, .text3data) is not consistent, I can't reliably detect them.

What I can see as remaining tasks to be done in this PR:

hmdne commented 1 month ago

To make things easier, I'm uploading the current version of the document generated:

document.tar.gz

I plan to continue development tomorrow (Sunday) on 4-6 AM GMT+2.

hmdne commented 1 month ago

We have generated a section tree at this point, so we may split sections into individual files. I am not entirely sure this approach will correctly translate into all documents, not only the one we are working on.

hmdne commented 1 month ago

Thanks to a suggestion from @xyz65535 I have handled indentation in the document with [none] unordered lists. This should preserve as much semantics from the incoming document as possible.

In addition, I finalized a plugin implementation. It is now possible to plug-in at any meaningful state of AsciiDoc generation. I suppose this could be used to add something like a Metanorma plug-in, that would for instance try to extract and produce data that is meaningful to Metanorma, but not necessarily in the AsciiDoc standard. The plugin architecture should support multiple plugins to be used for any conversion.

hmdne commented 1 month ago

Here's some example from 7.1.2.4:

Original document:

image

Our document:

image

AsciiDoc for that fragment:

image

ronaldtse commented 1 month ago

@hmdne the ideal AsciiDoc encoding:

==== 変換規則

===== スキーマ変換規則

* スキーマ変換規則は、1-UR3.0及びCityGML2.0に従う。
* なお、標準製品仕様書は、応用スキーマクラス図及びこれに対応するXMLSchemaを新規に作成するのではなく、1-UR3.0及びCityGML2.0から必要な部分のみを選択し、使用している。
* 応用スキーマクラス図に示す、クラス名、属性名及び関連役割名は、1-UR3.0及びCityGML2.0において定義されたタグに一致させている。
* また、複数の名前空間から選択しているため、全てのクラス名に、エ-UR3.0又はCityGML2.0名前空間の接頭辞を付ける。

===== インスタンス変換規則

GMLに準拠する。

* オブジェクト識別子(gml:id)
+
--
データ製品に含まれる全ての地物には、gml:idによる識別可能な値を与えることとし、その値には[接頭辞]_[UUID]を使用する。

[接頭辞]は、CityGML及びューURの各パッケージに与えられた接頭辞(表7-4)を使用する。

[UUID]は、Universally Unique Identifier(UUID)[2]とする。UUIDとは、ソフトウェア上でオブジェクトを一意に識別するための識別子であり、128ビット(16バイト)の値で表す。先頭から4ビットごとに16進数の値(0~f)に変換し、8桁-4桁-4桁-4桁-12桁に切って表現する。
--

* 集成の実装
+
--
応用スキーマに示された地物間の集成は、部品となるオブジェクトを、全体となるオブジェクトの子要素として記述する。

この時、部品となるオブジェクトの識別子(gm1:id)を、全体となるオブジェクト以外のオブジェクトが参照してもよい。
--

* 空間参照系の識別
+
--
幾何オブジェクトに適用される空間参照系は、都市モデル(core:CityModel)に挿入されるEnvelop要素の属性snsNameにおいて、以下のEPSGコードを挿入することにより識別する。

[cols="9,4"]
|===
| 空間参照系の名称 | srsNameに挿入する値

| 日本測地系2011における経緯度座標系と東京湾平均海面を基準とする標高の複合座標参照系
| http://www.opengis.net/def/crs/EPSG/0/6697
|===
--

* schemaLocationの指定
+
i-URの符号化様は、30都市モデル内のschemasフォルダ(7.2.4)に格納したXMLSchemaファイルへの相対パスによりschemaLocationを指定する。

The interesting thing about the PLATEAU documents is they use the clause scheme like this:

Screenshot 2024-05-27 at 5 53 35 PM

So the Level 4 and Level 5 are actually not lists, they are clauses (sections).

hmdne commented 1 month ago

The last clause level is not something we can extract programmatically, as the only class we have available is "text2data" - all we can deduce from that is that the author intended a "level 2 indentation". This class is used a lot in the document, for instance the underlined parts are also "text2data":

image

While this example in particular we handle specially as per your request, it's compiled into a numbered list, in other part of the document, those are "text2data":

image

I see no way from this how to interpret "text2data" in any other way, programmatically, as "level 2 indentation" and that's what I try to accomplish with lists.

ronaldtse commented 1 month ago

@hmdne there are always a balance between automated processing and manual processing, and I do agree that there are some portions we have to manually fix up after automated processing. As long as we know what work remains (ping @metanorma/editors ) that's fine.

hmdne commented 1 month ago

I have completed the last task on this issue. This will still need some testing, but other than that, I don't see any more remaining problems with conversion.

Below is the (hopefully) final version of document, ready for review:

document.tar.gz

hmdne commented 1 month ago

@ronaldtse There was a minor fix uncovered by the test suite, but it doesn't affect the document. I think this PR is ready.

ronaldtse commented 1 month ago

@hmdne can you let me know how you've tested the feature?

This is what I used.

$ bundle exec reverse_adoc -rcoradoc/reverse_adoc/plugins/plateau --split-sections 2 --external-images -o plateau/index.adoc index.html

I have additional issues that I will file separately now.

ronaldtse commented 1 month ago

The remaining issues are at: