Create scraper for ISO OBP site

The ISO OBP (Online Browsing Platform) at https://www.iso.org/obp/ui provides a viewer to preview ISO standards.

It is a SPA and a Vaadim based website, so page content is fetched through AJAX and rendered using JavaScript.

The Ruby library to be used for scraping is:

https://github.com/rubycdp/vessel

When you search for a standard, it goes to a page with a URN, e.g.:

https://www.iso.org/obp/ui#iso:std:iso:5598:ed-3:v1:en

Here, iso:std:iso:5598:ed-3:v1:en URN of the standard to load.

Once on that page, you will see 3 pieces of content:

Title (above the two content panes)
Table of contents (as the left nav bar) (we don't need it)
Main content (main pane)

The URN iso:std:iso:5598:ed-3:v1:en brings you to the "ISO 5598:2019(en)" standard, and would look like this:

In the main content pane, you have to scroll to the bottom bit by bit in order to get all the content rendered. The main content can also contain figures, which we need to store.

Once all content is loaded, the page is ready for scraping.

The goal is to write a Ruby script that accepts this:

$ obp-access.rb -o iso-5598 iso:std:iso:5598:ed-3:v1:en

Here the -o option specifies the output directory. The contents should look like:

iso-5598/
+- images/
   +- figure1.png
   +- figure2.png
   +- ...
+- metadata.yaml
+- index.html

Where:

images/ is the directory of all images scraped in the main content pane
metadata.yaml stores the document number, the document title and the URN
index.html is the content of the main content pane, starting at the element with the class sts-standard.

metadata.yaml looks like this:

---
scrape_date: 2024-04-06T00:00:00Z
identifier: ISO 5598:2019(en)
title: Fluid power systems and components — Vocabulary
urn: iso:std:iso:5598:ed-3:v1:en

index.html looks like this:

<div xmlns="http://www.w3.org/1999/xhtml" class="sts-standard">
  <div class="sts-section" id="toc_iso_std_iso_5598_ed-3_v1_en_sec_foreword">
    <h1 class="sts-sec-title">Foreword</h1>
    <div class="sts-p">ISO (the International Organization for Standardization) is a worldwide federation of national
      standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out
      through ISO technical committees. Each member body interested in a subject for which a technical committee
  <!-- ... -->
  <div class="sts-section" id="toc_iso_std_iso_5598_ed-3_v1_en_sec_intro">
    <h1 class="sts-sec-title">Introduction</h1>
    <div class="sts-p">In fluid power systems, power is transmitted and controlled through a fluid (liquid or gas) under
      pressure within a circuit.</div>
    <div class="sts-p">The purpose of this vocabulary is</div>
    <div class="list">
      <ul style="list-style-type: none">
        <li>
          <div class="sts-p"><span class="sts-label">—</span> to provide pertinent terms having a specific meaning in
            fluid power technology,</div>
        </li>
  <!-- ... -->
  <div class="sts-copyright">
    <div>©&nbsp;2019&nbsp;ISO — All rights reserved</div>
  </div>
  <div class="commentable" location="note"></div>
</div>

metanorma / obp-access

Create scraper for ISO OBP site #1