metanorma / obp-access

Data access on ISO OBP
1 stars 3 forks source link

Create scraper for ISO OBP site #1

Closed ronaldtse closed 7 months ago

ronaldtse commented 7 months ago

The ISO OBP (Online Browsing Platform) at https://www.iso.org/obp/ui provides a viewer to preview ISO standards.

It is a SPA and a Vaadim based website, so page content is fetched through AJAX and rendered using JavaScript.

The Ruby library to be used for scraping is:

When you search for a standard, it goes to a page with a URN, e.g.:

Here, iso:std:iso:5598:ed-3:v1:en URN of the standard to load.

Once on that page, you will see 3 pieces of content:

  1. Title (above the two content panes)
  2. Table of contents (as the left nav bar) (we don't need it)
  3. Main content (main pane)

The URN iso:std:iso:5598:ed-3:v1:en brings you to the "ISO 5598:2019(en)" standard, and would look like this:

Screenshot 2024-04-06 at 13 08 28

In the main content pane, you have to scroll to the bottom bit by bit in order to get all the content rendered. The main content can also contain figures, which we need to store.

Once all content is loaded, the page is ready for scraping.

The goal is to write a Ruby script that accepts this:

$ obp-access.rb -o iso-5598 iso:std:iso:5598:ed-3:v1:en

Here the -o option specifies the output directory. The contents should look like:

iso-5598/
+- images/
   +- figure1.png
   +- figure2.png
   +- ...
+- metadata.yaml
+- index.html

Where:

metadata.yaml looks like this:

---
scrape_date: 2024-04-06T00:00:00Z
identifier: ISO 5598:2019(en)
title: Fluid power systems and components — Vocabulary
urn: iso:std:iso:5598:ed-3:v1:en

index.html looks like this:

<div xmlns="http://www.w3.org/1999/xhtml" class="sts-standard">
  <div class="sts-section" id="toc_iso_std_iso_5598_ed-3_v1_en_sec_foreword">
    <h1 class="sts-sec-title">Foreword</h1>
    <div class="sts-p">ISO (the International Organization for Standardization) is a worldwide federation of national
      standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out
      through ISO technical committees. Each member body interested in a subject for which a technical committee
  <!-- ... -->
  <div class="sts-section" id="toc_iso_std_iso_5598_ed-3_v1_en_sec_intro">
    <h1 class="sts-sec-title">Introduction</h1>
    <div class="sts-p">In fluid power systems, power is transmitted and controlled through a fluid (liquid or gas) under
      pressure within a circuit.</div>
    <div class="sts-p">The purpose of this vocabulary is</div>
    <div class="list">
      <ul style="list-style-type: none">
        <li>
          <div class="sts-p"><span class="sts-label">—</span> to provide pertinent terms having a specific meaning in
            fluid power technology,</div>
        </li>
  <!-- ... -->
  <div class="sts-copyright">
    <div>©&nbsp;2019&nbsp;ISO — All rights reserved</div>
  </div>
  <div class="commentable" location="note"></div>
</div>
Joshuikrish commented 7 months ago

I think I could work on it. If you are interested than let's collaborate on it.