semanticClimate / cma3-test

CMA 3 test: CSS needed for document typesetting and automation of manifest and ToCs
Other
0 stars 0 forks source link

Generating HTML tags with CSS - is its a good idea? will it work? #2

Open mrchristian opened 10 months ago

mrchristian commented 10 months ago

Question: I need to ask for a bit of a sanitys check on how we are combining HTML and CSS for a publishing pipeline we are putting together to demo semantic publicaton of UNFCCC publications. We're at a crossroads and we need some advice.

Tagging to see if you can comment please @johanneswilm and @MurakamiShinyu

We have an HTML file that has been generated from Text and Data Miniing software pyamihtml. The HTML currently only has Divs and Spans generated - see example: 1_4_CMA_3_decis.html. The HTML file was generated as a conversion from this PDF file: 1_4_CMA_3.pdf.

The question is if its a good idea or not to be using classes for the Divs and Spans to generate different HTML tags that we'll also be using as CSS type selectors - such as: sections, H1-6, anchor links, paragraphs, lists, etc?

Using the pyamihtml software we can reprocess the HTML file to add classes to the Divs and Spans, or alternatively we can add HTML tags needed.

Is this just going to be bad practice?

Will we hit problems with Vivliostyle rendering?

Is this a problem for using W3C Publication Manifest, generating various ToC? Or having other software that can use W3C Publication Manifest such as https://www.npmjs.com/package/epubjs-cli

Here you can see a work in progress of the sample publication - CMA 3: FCCC/PA/CMA/2021/10/Add.1 (TEST)

Our end objective is be able to automate the whole pipeline for specific Corpus to transform PDFs into a package of HTML, W3C Publication Manifest, and other resources to create publications, and in addition we will be processing the HTML to add semantic markup and classes.

johanneswilm commented 10 months ago

The existing output format with lots of left/right/top/bottom placement instructions will likely make reflowing difficult. This seems to be meant for recreating an existing design exactly. But is not a good approach if the output is to be reused in other ways

Of course, you may just be able to throw those out if the content is in the right order (document order).

Inline styles (as used here) are not a good idea if one wants to use a coherent style on lots of content. Again - these could be stripped out or maybe even be replaced by classes based on the current content (such as all elements with red background styling get a class called "red-background", etc.).

In my own opinion, inline styles work well if one has some unique content that is only used once. For example to draw an image or similar where different divs are placed in specific places to make out parts of the image.

Using H1-6, paragraphs, etc. instead of spans and divs can have advantages in that it makes the semantic meaning of content clearer when a machine is reading the output. Screen readers, etc. will probably also pick up on that easier. Whether that is a concern depends a bit on whether end users will use the output or whether they will only see a version that has already been printed (after going through Vivliostyle).

Alternatively, you may also be able to achieve the same or a similar result by assigning aria-roles to the dovs/spans.

See

[1] https://www.boia.org/blog/accessibility-tips-using-the-div-and-span-elements

[2] https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA/ARIA_Techniques

mrchristian commented 10 months ago

Hi Johannes,

Thanks for helping out here, appreciated.

The exact positioning coordinates and the inline styles are there as artifacts of the Text Data Mining processing and conversion from PDF to HTML. Both of these sets of data can be discards in some further founds of document processing.

The coordinates data will first be used to anchor the footnotes and then well discard or ignore that data.

Inline styles were there as the PDF to HTML conversion allows us to only to capture font name, font size, font characteristic. And each generated part of CSS is specific or one DIV or SPAN. These are then reprocessed and normalised down from a 1000+ classes to under 20 ish. In the end we'll move these out from being inline to being in a CSS file, or overriden in an external CSS file.

aria-role I'll have a look.

@rqpe have you seen aria role?

FYI - this is the pipeline were building to convert PDF to HTML https://github.com/petermr/pyamihtml/discussions/5

I'll update you on our progress and but the points you mention really help us down the line.

rqpe commented 10 months ago

Hi Simon,

I haven´t seen aria role.

On Mon, 11 Dec 2023 at 09:55, Simon Worthington @.***> wrote:

Hi Johannes,

Thanks for helping out here, appreciated.

The exact positioning coordinates and the inline styles are there as artifacts of the Text Data Mining processing and conversion from PDF to HTML. Both of these sets of data can be discards in some further founds of document processing.

The coordinates data will first be used to anchor the footnotes and then well discard or ignore that data.

Inline styles were there as the PDF to HTML conversion allows us to only to capture font name, font size, font characteristic. And each generated part of CSS is specific or one DIV or SPAN. These are then reprocessed and normalised down from a 1000+ classes to under 20 ish. In the end we'll move these out from being inline to being in a CSS file, or overriden in an external CSS file.

aria-role I'll have a look.

@rqpe have you seen aria role?

FYI - this is the pipeline were building to convert PDF to HTML petermr/pyamihtml#5

I'll update you on our progress and but the points you mention really help us down the line.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>