metanorma / obp-access

Data access on ISO OBP
1 stars 3 forks source link

Output NISO STS XML format from an ISO OBP HTML #7

Open ronaldtse opened 7 months ago

ronaldtse commented 7 months ago

From:

The ISO OBP HTML is actually rendered from data of an XML format called "NISO STS" (the ISO flavor of it).

Instead of just the HTML output, we also want to output the NISO XML format.

Use case

Some ISO authors have to start documents from the ISO website as they are unable to obtain the STS files.

On the ISO OBP, informative content, such vocabulary, is freely available. The best way is to give them an automated way to extract this data.

Mechanism

The steps shall be as follows:

  1. Download the index.html for a particular URN
  2. From index.html, convert it into an STS XML document (using the code in the PR #6), and then using the new sts gem to write it as STS XML.

This is a Ruby script that somewhat parses index.html, it's not yet complete. It is provided in:

CLI:

$ bundle exec exe/obp-access -o output iso:std:iso:5598:ed-3:v1:en
$ bundle exec exe/obp2sts output/index.html

=> writes out:

Library:

stshtml = Obp::StsHtml.new('output/index.html')
stsxmltext = stshtml.clean.to_xml
sts = Sts::NisoSts::Standard.from_xml(stsxmltext)
puts sts.to_xml(pretty: true)

Work to be done

roberthopman commented 6 months ago

ruby 3.3.0, branch rt-obp2sts, running bundle exec exe/obp-access -o output iso:std:iso:5598:ed-3:v1:en and bundle exec exe/obp2sts output/index.html, it returns in index.html.sts.xlm one line with <standard/>. At this moment, the expected output? @ronaldtse

ronaldtse commented 5 months ago

@roberthopman sorry for the delay in replying!

one line with <standard/>

No, it is supposed to provide content. Right now, the output is incorrect. The task is to fix the output.

So there are 3 steps:

  1. The input is output/index.html, which is the raw HTML fetched using the obp-access command. It is correct.
  2. The intermediary file is output/index.html.xml, which is "supposed" to be parseable by the sts gem. It does contain content, but is apparently incorrect and hence cannot be parsed by the sts gem.
  3. The final file is output/index.html.sts.xml, which is generated from output/index.html.xml. It is empty because it cannot read the intermediary file.
ronaldtse commented 5 months ago

This is now updated in #6, with now a document structure created using the sts gem.

It already does a reasonable transform of the HTML file into STS by declarative building.

There are a number of TODOs in the code:

  1. The mixed content elements, such as <p> and <sec>, do not fully contain proper content. e.g. if you had <std-id> or <i> inside the content, they will be lost. This is a general issue about Sts::Mapper because I don't know how to actually use it properly. (ping @HassanAkbar )
  2. The "Terms and definitions" section need additional treatment, see the sample document.
  3. The "Normative references" section need treatment like the bibliography, which currently works to some extent (it does build a proper <ref-list> from ISO 5598, but I haven't tested against other documents.
  4. Annexes are not handled right now.
  5. (Important for @HassanAkbar ) I can't get the sts.to_xml method to generate XML content except for <standard .../>. Please help.

We're getting there.

HassanAkbar commented 1 week ago

@ronaldtse I have a few questions related to this

  1. There is no mapping for the <a> tag in the sts-ruby gem. Should we map them to ext-link or somewhere else? There are some internal references as well in a tag, where should we map those?
  2. I was unable to find the mapping for entailedTerm-num in sts-ruby gem. Where should we map those?

Is there any guidelines or mapping available for all the HTML classes to sts-ruby classes? Or is there some example documents that I can use as reference related to how the expected output should be for the output/index.html.xml and output/index.html.sts.xml files.