Here, iso:std:iso:5598:ed-3:v1:en URN of the standard to load.
Once on that page, you will see 3 pieces of content:
Title (above the two content panes)
Table of contents (as the left nav bar) (we don't need it)
Main content (main pane)
The URN iso:std:iso:5598:ed-3:v1:en brings you to the "ISO 5598:2019(en)" standard, and would look like this:
In the main content pane, you have to scroll to the bottom bit by bit in order to get all the content rendered. The main content can also contain figures, which we need to store.
Once all content is loaded, the page is ready for scraping.
The goal is to write a Ruby script that accepts this:
images/ is the directory of all images scraped in the main content pane
metadata.yaml stores the document number, the document title and the URN
index.html is the content of the main content pane, starting at the element with the class sts-standard.
metadata.yaml looks like this:
---
scrape_date: 2024-04-06T00:00:00Z
identifier: ISO 5598:2019(en)
title: Fluid power systems and components — Vocabulary
urn: iso:std:iso:5598:ed-3:v1:en
The ISO OBP (Online Browsing Platform) at https://www.iso.org/obp/ui provides a viewer to preview ISO standards.
It is a SPA and a Vaadim based website, so page content is fetched through AJAX and rendered using JavaScript.
The Ruby library to be used for scraping is:
When you search for a standard, it goes to a page with a URN, e.g.:
Here,
iso:std:iso:5598:ed-3:v1:en
URN of the standard to load.Once on that page, you will see 3 pieces of content:
The URN
iso:std:iso:5598:ed-3:v1:en
brings you to the "ISO 5598:2019(en)" standard, and would look like this:In the main content pane, you have to scroll to the bottom bit by bit in order to get all the content rendered. The main content can also contain figures, which we need to store.
Once all content is loaded, the page is ready for scraping.
The goal is to write a Ruby script that accepts this:
Here the
-o
option specifies the output directory. The contents should look like:Where:
images/
is the directory of all images scraped in the main content panemetadata.yaml
stores the document number, the document title and the URNindex.html
is the content of the main content pane, starting at the element with the classsts-standard
.metadata.yaml
looks like this:index.html
looks like this: