Convert "Handbook of 3D City Models: Standard Data Product Specification for 3D City Model" into Metanorma

ronaldtse commented 10 months ago

This work is done under the MLIT Plateau project.

The "Handbook of 3D City Models: Standard Data Product Specification for 3D City Model" is seemingly published in the Metanorma HTML format, however, a closer look reveals it is created using Nuxt but just looking like Metanorma!

HTML: https://www.mlit.go.jp/plateaudocument (extracted here index.html.zip)
PDF: https://www.mlit.go.jp/plateau/file/libraries/doc/plateau_doc_0001_ver03.pdf (also uploaded here for speed reasons plateau_doc_0001_ver03.pdf)

A new flavor will be developed for MLIT / Plateau, so the encoding syntax will be subject to change.

We will need to do the following:

extract all the images into static files (probably some script acting on the HTML will work)
extract all the text stuff (text, tables) into AsciiDoc using reverse_adoc.

Font: It also uses the "Tokyo CityFont Cond StdN M" ("Tokyo CityFont Condensed M"): TokyoCityFontCondStdN-R.1c4f41e.otf.zip

The font page is here: https://typeproject.com/en/fonts/tokyocityfont . This is clearly a paid font, and the document only comes with the "Regular" style. So we need to create a private Fontist repository for this font.

manuelfuenmayor commented 9 months ago

extract all the text stuff (text, tables) into AsciiDoc using reverse_adoc.

reverse_adoc takes like forever to finish converting this document. I couldn't get an output.

ronaldtse commented 9 months ago

So strange though... maybe @HassanAkbar can have a look at reverse_adoc?

manuelfuenmayor commented 9 months ago

Probably due to the length of the document (more than 800 pages). @anermina, could you try convert this document using reverse_adoc to confirm this behavior?

ronaldtse commented 9 months ago

@manuelfuenmayor I’ve started trying it, so @anermina there’s no need to try. Thanks!!

ronaldtse commented 9 months ago

Some challenges with this document:

the size is too large to even load in one HTML file (50 MB)
There is too much content for reasonable navigation

ronaldtse commented 9 months ago

@manuelfuenmayor reverse_adoc worked on my computer but don't know how long it took. I've pushed it but it's a 20MB file because the images are all inlined. The first thing we have to do is to extract the images into separate files and there are a lot of them. Maybe some 'grep' command would be able to extract all the images... unless we update the reverse_adoc gem to export images individually.

manuelfuenmayor commented 9 months ago

Thanks @ronaldtse. After removing the images, I was able to get an output from reverse_adoc. I've extracted the images using grep (along with base64), as you suggested.

manuelfuenmayor commented 9 months ago

There is this case of text format: format1

It seems like a case of sub-section more than a numbered list.

ronaldtse commented 9 months ago

I agree. Let’s make it into a subsection.

ronaldtse commented 9 months ago

@manuelfuenmayor is this work completed?

manuelfuenmayor commented 9 months ago

@ronaldtse this work is really far from being completed.

This document has more than 800 not-so-simple tables (with images embedded) that I have to correct manually because reverse_adoc doesn't encode them 100% correct (it is not its fault, this document's HTML is complex).

I've been tweaking the reverse_adoc code to see if I can ease the workload a little. I've done a couple of things already.

I estimate a delivery time of one week or two.

ronaldtse commented 9 months ago

@manuelfuenmayor then maybe we should really fix up reverse_adoc to make it work…

manuelfuenmayor commented 8 months ago

Document encoded in https://github.com/metanorma/mn-samples-mlit/pull/2

metanorma / mn-samples-plateau

Convert "Handbook of 3D City Models: Standard Data Product Specification for 3D City Model" into Metanorma #1