metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 4 forks source link

coradoc: partition the output .adoc file based on heading/style levels not working #133

Closed ReesePlews closed 1 month ago

ReesePlews commented 1 month ago

hello @hmdne

i am working with an HTML file created by loading a correctly styled .docx (from an SDO) into LibreOffice then "save as" into .html. additionally from this "save as" 43 images are output along with the single .html file.

the current Gemfile contains

# gem "coradoc"
# 20240920
gem "coradoc", git: "https://github.com/metanorma/coradoc"
gem "fontist", git: "https://github.com/hmdne/fontist", ref: "patch-1"

coradoc confirmation:

PS E:\github\iso-22726-1\conversion> bundle info coradoc
  * coradoc (1.1.2 0c1dbc3)
        Summary: AsciiDoc parser for metanorma
        Homepage: https://www.metanorma.org
        Source Code: https://github.com/metanorma/coradoc
        Path: C:/tools/ruby33/lib/ruby/gems/3.3.0/bundler/gems/coradoc-0c1dbc37e764
        Reverse Dependencies:
                metanorma-plugin-lutaml (0.7.9) depends on coradoc (~> 1.1.1)

the command line information shows the following:

PS E:\github\iso-22726-1\conversion> bundle exec reverse_adoc --help
Usage: reverse_adoc [options] <file>
    -m, --mathml2asciimath           Convert MathML to AsciiMath
    -o, --output=FILENAME            Output file to write to
    -e, --external-images            Export images if data URI
    -u [pass_through, drop, bypass, raise],
        --unknown_tags               Unknown tag handling (default: pass_through)
    -r, --require RUBYMODULE         Require additional Ruby file
        --track-time                 Track time spent on each step
        --split-sections LEVEL       Split sections up to LEVEL
    -v, --version                    Version information
    -h, --help                       Prints this help

using this command:

bundle exec reverse_adoc --split-sections 1 --external-images -o ./index.adoc ./infile.html

only index.adoc is being created along with 20 renumbered images in the "images" folder. there are no messages displayed during the conversion process.

i am wondering what i am doing wrong with the --split-sections 1 parameter? i have also tried --split-sections=1 but there is no change in output.

is there a hidden parameter that shows processing messages?

thank you.

hmdne commented 1 month ago

First of all, as I noted, I'd recommend to convert .docx directly to .html (using coradoc command, not reverse_adoc - which assumes HTML). Under the hood it will use Libreoffice to do the necessary conversion, but it also contains a couple of filters to increase the success rate.

is there a hidden parameter that shows processing messages?

Yes, --track-time

It is likely your document after conversion doesn't create H1, H2, H3 etc. tags and therefore those can't be deduced (so can't be split). At this moment though I have no further ideas on how to increase the success rate, outside of the hint mentioned above, so unless I can get access to the document, I won't be able to help further.

ReesePlews commented 1 month ago

hello @hmdne thank you for the reply. i have uploaded my output from Libreoffice here i am thinking you may be able to access that.

in your earlier comment you suggested to convert .docx directly to .html using coradoc. i tried that but it seems to not expect html as output format only # Possible values: adoc, coradoc_tree_debug; am i misunderstanding something here?

> bundle exec coradoc convert -I docx ./infile.docx -O html -o ./infile_coradoc_cvt.html  Expected '--output-format' to be one of adoc, coradoc_tree_debug; got html
Deprecation warning: Thor exit with status 0 on errors. To keep this behavior, you must define `exit_on_failure?` in `Coradoc::CLI`
You can silence deprecations warning by setting the environment variable THOR_SILENCE_DEPRECATION.

am i doing something wrong on the command line?

i think i am interpreting the command line parameters as shown here:

> bundle exec coradoc --help convert
Usage:
  coradoc convert [FILE]

Options:
  -o, [--output=OUTPUT]                                                         # Output file to write
  -I, [--input-format=INPUT_FORMAT]                                             # Define input format (defaults to input file extension)
                                                                                # Possible values: adoc, html, docx
  -O, [--output-format=OUTPUT_FORMAT]                                           # Define output format (defaults to output file extension)
                                                                                # Possible values: adoc, coradoc_tree_debug
  -r, [--require=REQUIRE]                                                       # Require additional Ruby file (eg. to load a plugin)
  -e, [--external-images], [--no-external-images], [--skip-external-images]     # Extract images from input document
  -u, [--unknown-tags=UNKNOWN_TAGS]                                             # Unknown tag handling
                                                                                # Default: pass_through
                                                                                # Possible values: pass_through, drop, bypass, raise
  -m, [--mathml2asciimath], [--no-mathml2asciimath], [--skip-mathml2asciimath]  # Convert MathML to AsciiMath
      [--track-time], [--no-track-time], [--skip-track-time]                    # Track time spent on each step
      [--split-sections=LEVEL]                                                  # Split sections into separate files up to a provided level
                                                                                # Default: 0

Required At Least One:
  --output  --output_format

any advice is helpful. thank you.

hmdne commented 1 month ago

@ReesePlews Output format must be adoc. We are not doing two steps here.

hmdne commented 1 month ago

I can't access your file. Also it would be better if I got a docx file and not an intermediary. Only then I will be able to assess the possibilities.

ReesePlews commented 1 month ago

thank you @hmdne it is now clear that the output must be .adoc.

also the LibreOffice SDK seems to not be used in this process. instead the entire LibreOffice install seems to be required. that has been installed on the metanorma machine and added to the PATH. a sections folder is being created, but only 2 files are output, no matter what the level i use.

its really confusing there seem like two programs embedded together coradoc and reverse_adoc. after development is complete, please update the documentation.

during the conversion, this message appears on the screen: C:/Users/admin/.gem/ruby/3.3.0/gems/premailer-1.11.1/lib/premailer/adapter/nokogiri.rb:68: warning: [DEPRECATION] positional arguments are deprecated use keyword instead.

i will ask @ronaldtse to give you access to the repo where the files are.

if they are helpful in testing these issues, please use them. thank you.

ronaldtse commented 1 month ago

@hmdne you should have access to the files here:

ronaldtse commented 1 month ago

@ReesePlews there is no more “reverse_adoc”, the functionality is incorporated into “coradoc” now.

ReesePlews commented 1 month ago

thanks @ronaldtse i will not test with the reverse_adoc any more.

an update on my checking is shown here:

in reference_docs/conversion, infile.docx (is the SDO published document)

using this command:

> bundle exec coradoc convert -I docx ./infile.docx --split-sections=1 -O adoc -o ./index.adoc

an index.adoc is created along with a sectionsfolder with only two files section-01.adoc, and section-02.adoc

i would have expected 1 section for each clause and annex, etc.

re-running with:

> bundle exec coradoc convert -I docx ./infile.docx --split-sections=2 -O adoc -o ./index.adoc

produces no visible change from --split-sections=1

thanks again for the help.

hmdne commented 1 month ago

I have took a brief look at this document and have found numerous issues we have currently not to convert it correctly. I will work in the following days to fix some issues with that.

ReesePlews commented 1 month ago

thank you @hmdne appreciate you taking time to check. please discuss with @ronaldtse about this.

so my understanding is:

are there any other options, other than a manual process?

thanks again for checking

hmdne commented 1 month ago

@ReesePlews

coradoc will now only translate from docx to adoc.

No. It will internally use Libreoffice to translate docx to html and apply a couple of fixes. Which is why I suggest not doing that manually.

what happened to the html to adoc option, is that still there or is it no longer working?

It is still present and is working.


I have made a couple of fixes to the process (PR will follow shortly). It now creates a wider tree of files, corrects some of the issues we had, but there's a significant issue I wasn't able to solve. In short, let me try to explain this in layman terms:

We generally assume that titles (which is what we use to detect sections) are in the top level of the document. Let's assume a document is a flat tree semantically looking like this (I will use AsciiDoc to try to show the semantics, of course we are dealing with Word document, but showing DOCX code will help no one understand the issue):

== Heading

Content

=== Heading

|===
| some table
|===

This is fine. The document can be structured in HTML like, let's say:

<h1>Heading</h1>

<p>Content</p>

<h2>Heading</h2>

<table>(...)some table(...)</table>

This means, that each element is in a top level, we support that, and can split sections accordingly.

The incoming document, though, has something that can't be represented in AsciiDoc, so let me write some HTML:

<ol start='3'>
  <li>
    <h1>Section three</h1>
  </li>
<ol>

<p>Section content</p>

So, basically, author of this document embedded a title inside a one element numbered list starting at number 3 (to add the number 3, of course).

I managed to write some exception code, which will extract a title from such a fragment. So that's no longer an issue.

But then, likely use of that semantic, in some fragment of the document caused LibreOffice to basically output invalid HTML.

<ol start='4'>
  <li>
    <h1>Section four</h1>
  </li>
  <center>  <!-- What? Center can't happen inside OL! -->
    <table>(...)</table>
  </center></li>  <!-- LI is closed? But it wasn't even open! -->
  <li>
    <h1>Section five</h1>
  </li>
<ol>

<p>Section content</p>

I don't think I can do much about this form without breaking other documents possibly.

It is not that the incoming document is incorrect. I have inspected both DOCX and ODT versions of that file and they are correct. It is likely a bug in LibreOffice HTML converter.

What helps, is to open the document in LibreOffice, find each level 1 heading (there are 7 of them) and click the "Toggle ordered list" button in toolbar (the one highlighted in the middle of that image):

image

Then save that document as >>HTML<<.

Note that this is contrary to what I suggested earlier (to use plain DOCX). There is a reason for that.

We use a WordToMarkdown gem to automatically convert DOCX to HTML using LibreOffice. Furthermore, this gem does some postprocessing on the generated HTML and I assumed it may be a little helpful. I just found out there's a bug - it converts superscript to first level headings. We will probably agree on the fact it's incorrect. Doing that, it generates a section title. So, the document splits further sections based on footnote references which is not what we want.

Also note, that by level 1 heading I mean things like "Scope" or "Normative References". Sections:

Don't generate a heading when exported to HTML. So those you will have to handle manually.

ReesePlews commented 1 month ago

@hmdne and @webdev778 thank you both for working on this modification.

i appreciate the detailed information. there are a lot of "intricate points" related to what is best and what formats are useful to pursue, etc. i cannot say i understand them to any deep level. what i do know is, the input document is one year old. i do not know about the internal production of this file by the SDO. because it was a published version i would have expected the styles to be uniform and conforming to ms-word. i dont know why such a strange structure would be used for specific parts like you showed.

from Jan-2025 the SDO will make an implementation change for document authoring. i dont know if similar .docx files will be distributed from this SDO, or if the format will be changed again. for now, this is working well and suitable for my use. i am sure other users could benefit too from this update. i will mention these things to @ronaldtse.

today i have converted my document using the earlier command:

bundle exec coradoc convert -I docx ./infile.docx --split-sections=1 -O adoc -o ./index.adoc

and i have experimented with higher level numbers. i am not sure if there is merit to more sections or not. after the conversion the goal is to revise the document. that revision could see a restructure of the document too. the project team is not sure at this point. more smaller sections are probably easier to work with. i will consider.

do you think there is merit loading the .docx into ODT and making an html export as you describe above? i am using LibreOffice 24.8. if you would like me to test i can try, just let me know.

thank you both again for working on this fix!

hmdne commented 1 month ago

i dont know why such a strange structure would be used for specific parts like you showed.

I wouldn't say it's their fault. Likely MS Word just did that kind of formatting. And then HTML exporter has a bug.

do you think there is merit loading the .docx into ODT and making an html export as you describe above? i am using LibreOffice 24.8. if you would like me to test i can try, just let me know.

I have tried that on that document. No change.