Open ronaldtse opened 7 months ago
ruby 3.3.0, branch rt-obp2sts
, running bundle exec exe/obp-access -o output iso:std:iso:5598:ed-3:v1:en
and bundle exec exe/obp2sts output/index.html
, it returns in index.html.sts.xlm
one line with <standard/>
. At this moment, the expected output? @ronaldtse
@roberthopman sorry for the delay in replying!
one line with
<standard/>
No, it is supposed to provide content. Right now, the output is incorrect. The task is to fix the output.
So there are 3 steps:
output/index.html
, which is the raw HTML fetched using the obp-access
command. It is correct.output/index.html.xml
, which is "supposed" to be parseable by the sts
gem. It does contain content, but is apparently incorrect and hence cannot be parsed by the sts
gem.output/index.html.sts.xml
, which is generated from output/index.html.xml
. It is empty because it cannot read the intermediary file.This is now updated in #6, with now a document structure created using the sts
gem.
It already does a reasonable transform of the HTML file into STS by declarative building.
There are a number of TODOs in the code:
<p>
and <sec>
, do not fully contain proper content. e.g. if you had <std-id>
or <i>
inside the content, they will be lost. This is a general issue about Sts::Mapper because I don't know how to actually use it properly. (ping @HassanAkbar )bibliography
, which currently works to some extent (it does build a proper <ref-list>
from ISO 5598, but I haven't tested against other documents.sts.to_xml
method to generate XML content except for <standard .../>
. Please help.We're getting there.
@ronaldtse I have a few questions related to this
<a>
tag in the sts-ruby
gem. Should we map them to ext-link
or somewhere else? There are some internal references as well in a
tag, where should we map those?entailedTerm-num
in sts-ruby
gem. Where should we map those?Is there any guidelines or mapping available for all the HTML classes to sts-ruby classes? Or is there some example documents that I can use as reference related to how the expected output should be for the output/index.html.xml
and output/index.html.sts.xml
files.
From:
The ISO OBP HTML is actually rendered from data of an XML format called "NISO STS" (the ISO flavor of it).
Instead of just the HTML output, we also want to output the NISO XML format.
Use case
Some ISO authors have to start documents from the ISO website as they are unable to obtain the STS files.
On the ISO OBP, informative content, such vocabulary, is freely available. The best way is to give them an automated way to extract this data.
Mechanism
The steps shall be as follows:
index.html
for a particular URNindex.html
, convert it into an STS XML document (using the code in the PR #6), and then using the newsts
gem to write it as STS XML.This is a Ruby script that somewhat parses
index.html
, it's not yet complete. It is provided in:6
CLI:
=> writes out:
output/index.html.xml
: STS XML file generated by obp2stsoutput/index.html.sts.xml
: STS XML file generated by thests
gem givenoutput/index.html.xml
as inputLibrary:
Work to be done
StsHtml
class completely converts all content from HTML to STSStsHtml#to_xml
is properly parseable by thests
gem (main branch)