PLATEAU Document 02 work

hmdne commented 5 months ago

Tasks that need to be done for the document 02 to be parsed (it will grow over time):

Breaking:

[x] handle documents with empty tables (#98)
[x] handle weird tables that need to be appended from the left
[x] handle documents that reference missing images
[x] handle different section conditions

Semantic:

[x] handle aside and table headers
[x] handle code block
[x] handle sections
[x] handle Annexes somehow

Minor bugs:

[x] in 1.4.2: line break character and list item issue
[x] in 1.4.2.4 (and further): numbered list made with CSS markup based on unnumbered list
[x] missing images fix introduced a bug that floods document with FIXME

Manual fixes needed:

Missing image (search for // FIXME)
in 1.4.2.2: 4th list level issue (using full width spaces and ・ to denote a 4th level list item)
in 1.4.2.2: list misaligned
in 4.4.1: list misaligned
in 5.4.5: lists to correct
some code blocks aren't distinguished from aside blocks. this needs to be manually corrected
in C.3.1.3: some extraneous \
in H.2.2: HTML tags
in general - inspect the lists. they are kinda hacky, since the document is hacky.

cc: @ronaldtse @ReesePlews

Metanorma PR checklist

[ ] Breaking changes (list related PRs)
[ ] Documentation update required (create task for this)
[ ] External dependency introduced (documentation update need)
[ ] Gem with native library introduced

hmdne commented 5 months ago

First question: this document starts with section 0. Should I remove the numbers, and if yes, then how to handle that section 0?

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.34%. Comparing base (defb04a) to head (78225b1). Report is 67 commits behind head on main.

Files	Patch %	Lines
lib/coradoc/reverse_adoc/postprocessor.rb	85.71%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #101 +/- ## ========================================== + Coverage 96.67% 98.34% +1.66% ========================================== Files 42 46 +4 Lines 1054 1327 +273 ========================================== + Hits 1019 1305 +286 + Misses 35 22 -13 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hmdne commented 5 months ago

I have decided to set section 0's style to "abstract".

Then we have a number of annexes. Those are lettered. It would make sense to set them as "appendix", but then:

          when "改訂履歴" # Document history

is also appendix.

hmdne commented 5 months ago

So, I handled annexes as appendices. But document history is also an appendix for now.

hmdne commented 5 months ago

This is ready. I can find no more issues that can be fixed automatically. I am including the document 02 below:

document.tar.gz

ReesePlews commented 5 months ago

hello @hmdne thank you for starting work on document two.

the issues about "strange tables" may either be mistakes or just editorial changes to the document to help mitigate page breaks, etc. they could be restyled manually to make them more normal, then the conversion tool would handle them, and metanorma would compensate for things like page breaks.

what would you like me to do to help with this?

hmdne commented 5 months ago

@ReesePlews The problem in general is much more complicated unfortunately. There are different semantics in HTML and AsciiDoc, for instance, in AsciiDoc you should define a number of columns up front (in some cases it's mandatory, in some not?). Then there are colspans and rowspans - if only they were applied correctly, we wouldn't have an issue, but the document often has tables with colspans and rowspans that escape the table boundary. After all, it's not a real error in HTML - the table renders correctly, but in AsciiDoc we basically have no delimiter for rows like in HTML (at least if the number of columns is specified) and it's more strict from what I experienced.

Either way, I have fixed those issues and both 01 and 02 documents do handle all tables correctly. So in some cases we just adjust the colspans/rowspans, in some we add empty cells that are missing in a row. You can take a look at the test [1], for the cases that I have covered, though I'm sure there will be more problems in the future.

In case of document 02, there was one table which I had to add cells in front of, not at the end of the row.

Anyway, to be clear, the document is now rendered correctly, mostly, with the issues I have mentioned above.

[1] https://github.com/hmdne/coradoc/blob/6f24ceb124b1804420439bc1b35641bc354c2075/spec/reverse_adoc/lib/reverse_adoc/converters/table_ensure_spec.rb

ReesePlews commented 5 months ago

@hmdne perhaps we should add this to the mn-samples-plateau repo, so we can work with that more easily? something like 002-v4 ?

hmdne commented 5 months ago

@ReesePlews I have created a PR to this repository, feel free to merge.

ReesePlews commented 5 months ago

@hmdne thank you for the explanation. if i am understanding correctly there were some cases in the HTML for both doc1 and doc2 that had to be corrected manually, is that correct? and those manual corrections you made are in the html files currently found in reference-docs folder in the repo? if so, should we change the name so we know they are modified?

hmdne commented 5 months ago

@ReesePlews

No. I made no corrections by hand. I added a verbatim output from reverse_adoc and marked some places (in the first post) where I found hand corrections will be needed.

Anyway, while the tables were quite hard to do, this I was able to solve with algorithms. The bigger problem are lists (and indentation). In the incoming document, they tend to be... without any apparent order. Those are made using one or more of:

CSS styles
hand written (ie. using 1., 1), other symbols)
wide space (the CJK character) indentation

I was able to just handle the CSS styles, which is why sometimes there are fragments like:

[none]
**** List item

hmdne commented 5 months ago

To be clear, I used this document as input, also verbatim:

https://github.com/metanorma/coradoc/issues/77

(The second HTML)

hmdne commented 5 months ago

where I found hand corrections will be needed.

In AsciiDoc though, not in HTML.

ReesePlews commented 5 months ago

@hmdne thank you for the information. many things are new to me about this conversion process, so there is some confusion. for both docs 1 and 2 once a converted set of asciidoc files have been produced, i dont think we will be using the original html anymore, as the metanorma tool chain would be generating all of the doc types (html, doc, pdf). of course we want the converter to handle as much as possible but when the input html data is so messed up, there is only so much it can do.

if i am understanding correctly, the manual revisions you have made now (in the original html for both documents) and is enough to get highly usable asciidoc bases ready for further work.

please do the merge to the main branch. i am still somewhat hesitant about doing that, especially when so much manual work has been done i dont want to cause a problem. thank you

hmdne commented 5 months ago

@ReesePlews As I said before, no manual modifications have been done to the original HTML document (nor to AsciiDoc). All modifications are done automatically, on the HTML document during the conversion process.

If you use reverse_adoc (current branch) on the input document, you will have exactly the same AsciiDoc output as I have committed to the -samples repo.

Do you want me to merge this PR, or https://github.com/metanorma/mn-samples-plateau/pull/23 ?

ReesePlews commented 5 months ago

@hmdne thank you for clarifying and updating the conversion tool to handle more unique import cases.

yes, please the PR so that doc2 will appear in the sources directory. later today i will try and run the converter on doc2 using the instructions from our work with doc1.

thank you again for making these updates to the conversion tool.

ReesePlews commented 5 months ago

@hmdne a question...

i have tried to modify my .gitignore file so that my working tests do not get pushed to the main repo. however github desktop is asking me to submit these to "main" should i do that, for the following two files?

hmdne commented 5 months ago

Gemfile surely shouldn't be gitignored in upstream code. Also, instead of using .gitignore, you may use a local exclude:

.git/info/exclude

It has the same syntax as .gitignore, but is local to your repository. It works alongside .gitignore.

ReesePlews commented 5 months ago

thank you @hmdne i will remove that line from gitignore and read about "exclude"

ReesePlews commented 5 months ago

hello @hmdne what should i do with the edit to my Gemfile? should i commit back to the server or stash that?

hmdne commented 5 months ago

@ReesePlews IIRC you just uncommented "sassc". It should be safe to commit.

ReesePlews commented 5 months ago

hello @hmdne i committed the Gemfile but it stopped with that same --no-no-install-fonts error can you check that when you have time. thank you. https://github.com/metanorma/mn-samples-plateau/actions/runs/9416095705/job/25938475324

ronaldtse commented 5 months ago

@ReesePlews I will update the workflow. It's a separate issue.

ronaldtse commented 5 months ago

@ReesePlews the fixes are here: https://github.com/metanorma/mn-samples-plateau/pull/24

ReesePlews commented 5 months ago

thanks @ronaldtse, does each repo need to be updated?

ronaldtse commented 5 months ago

@ReesePlews yes.

FYI @CAMOBAP the impact of not updating v1 is real.

CAMOBAP commented 5 months ago

@ronaldtse I understand, but I will still try to defend the approach to not do it.

GitHub Actions delivery is not much different from delivering other software, we have to update major on breaking changes, this is what happens with install-fonts flag rename

If we decide to update v1 instead of moving to the next one, why do we use versioning for actions at all?

Versioning is not always about comfort but about consistency too.

We have some abandoned flavors that are working with old version of metanorma, v1 update will break them

If one day metanorma will introduce breaking changes i suppose we will not update previous one, so why should we do this way with GHA?

ronaldtse commented 5 months ago

I don’t disagree with the principles you stated but I have two comments.

The “ruby/setup-ruby” action supports all versions of Ruby, and has been at v1 forever. This is good for ensuring that users who adopted it previously will continue to have their jobs working.
If an action only works for a limited set of versions of Metanorma, we should lock the action only for those versions.

From the users’ perspective, doing an upgrade on an action step, when they have no idea what has even changed (and clearly they won’t care about the difference in no-install or no-no-install), is just an annoyance and leaves a bad taste.

In any case, we cannot have an action version that fails by default.

ronaldtse commented 5 months ago

And this is probably the moment we think about allowing users to choose Metanorma versions in the action!

CAMOBAP commented 5 months ago

In any case, we cannot have an action version that fails by default.

Good point, let me implement the fix that will handle flags based on a version of metanorma

And this is probably the moment we think about allowing users to choose Metanorma versions in the action!

Users already have control over this, by choosing the version in:

docker container setup
actions-mn/setup

metanorma / coradoc

PLATEAU Document 02 work #101

Metanorma PR checklist

Codecov Report