glossarist / iev

Ruby gem for fetching IEV Electropedia content in a structured manner
BSD 2-Clause "Simplified" License
2 stars 0 forks source link

Use IEV OpenData interface rather than scraping #5

Open ronaldtse opened 6 years ago

ronaldtse commented 6 years ago

Load IEV areas: (this is in JSON)

curl --header "Content-Type: application/json"  https://opendata-api.iec.ch/v1/opendata/areas

=>

[
  {"dateCreated":"","id":"101","description":"Mathematics"}, 
  {"dateCreated":"","id":"102","description":"Mathematics - General concepts and linear algebra"},
  ...
]

Load 101 Mathematics: (this is in XML)

curl --header "Content-Type: application/xml"  https://opendata-api.iec.ch/v1/opendata/iev/101/{yourKey}

=>

<?xml version="1.0" encoding="UTF-8"?>
<subjectarea version="1.0beta" id="101">
<concept ievref="101-12-01">
<lang-set lang-id="en">
<term-name>information</term-name>
<definition>knowledge concerning objects, such as facts, events, things, processes, or ideas, including concepts, that within a certain context has a particular meaning</definition>
<pubdate>1998-04</pubdate>
<source>ISO/IEC 2382-1, 01.01.01, 701-01-01 MOD</source>
...
ronaldtse commented 6 years ago

The IEV OpenData API doesn't quite work immediately for our case where we're importing only one term at once. Reasons below (also submitted as feedback to IEC Terminology team).

  1. Inconsistency with loading formats.

The query for /areas is in JSON, but the term results are returned in XML. In particular, the term results endpoint does not support JSON (it returns XML regardless of the format requested).

  1. Inability to load a particular term entry.

When using IEV terms in a standard document, it is most convenient to refer it using a “unique ID” (i.e. the IEV term ID like 101-12-09).

Currently, the OpenData API only provides a method to load all entries within an area, such as:

https://opendata-api.iec.ch/v1/opendata/iev/101/{yourKey}

This request will receive a response with all terms under the 101 area, which is very long and mostly useless to the user.

We hope there will be an additional endpoint like https://opendata-api.iec.ch/v1/opendata/iev/101/12-09/{yourKey} that will return a single term (and all its associated languages, or ability to load only one language).

  1. Grouping of concepts

The response of a “concept" is currently separately returned per language. However, the multiple languages of a term should be grouped under the same “concept”.

Currently it is:

<concept ievref="101-12-01">
<lang-set lang-id="en">
<term-name>information</term-name>
<definition>knowledge concerning objects, such as facts, events, things, processes, or ideas, including concepts, that within a certain context has a particular meaning</definition>
<pubdate>1998-04</pubdate>
<source>ISO/IEC 2382-1, 01.01.01, 701-01-01 MOD</source>
</lang-set>
</concept>

<concept ievref="101-12-01">
<lang-set lang-id="fr">
<term-name>information</term-name>
<attribute>f</attribute>
<definition>connaissance concernant un objet tel qu'un fait, un événement, une chose, un processus ou une idée, y compris une notion, et qui, dans un contexte déterminé, a une signification particulière</definition>
<pubdate>1998-04</pubdate>
<source>ISO/CEI 2382-1, 01.01.01, 701-01-01 MOD</source>
</lang-set>
</concept>
…

It would be better to be:

<concept ievref="101-12-01">

<lang-set lang-id="en">
<term-name>information</term-name>
<definition>knowledge concerning objects, such as facts, events, things, processes, or ideas, including concepts, that within a certain context has a particular meaning</definition>
<pubdate>1998-04</pubdate>
<source>ISO/IEC 2382-1, 01.01.01, 701-01-01 MOD</source>
</lang-set>

<lang-set lang-id="fr">
<term-name>information</term-name>
<attribute>f</attribute>
<definition>connaissance concernant un objet tel qu'un fait, un événement, une chose, un processus ou une idée, y compris une notion, et qui, dans un contexte déterminé, a une signification particulière</definition>
<pubdate>1998-04</pubdate>
<source>ISO/CEI 2382-1, 01.01.01, 701-01-01 MOD</source>
</lang-set>

</concept>

…
ronaldtse commented 6 years ago

The IEV database structure is defined in IEC Directives Supplement Annex SK (http://www.iec.ch/members_experts/refdocs/iec/isoiecdir-iecsup%7Bed11.0%7Den.pdf)

In the following descriptions, references are provided to the IEC Supplement, Annex SK, which gives the rules
for the structure and content of the Electropedia data, e.g. "[SK.3.1.2]".
version is the version of the XML schema
subject area is the title of the subject area (or IEV part)
concept is a container for one language version of the concept
id is the number of the subject area (or IEV part) [SK.2.1.3; SK.2.1.5]
lang-id is the ISO alpha-2 language code [SK.2.1.4]
ievref is the reference of the concept in the Electropedia [SK.2.1.5]
<term-name> is the preferred term designating the concept [SK.3.1.3]
<attribute> contains any attributes to the term [SK.3.1.3.4.2, SK.3.1.3.5.5, SK.3.1.3.5.6, SK.3.1.3.6]
<symbol> contains any symbols representing the concept [SK.3.1.2, SK.3.1.3]
<synonyms> is a container; a concept can contain up to 3 synonyms. Each synonym has an id, and is
defined by its name, its attribute and a status (Preferred, Admitted or Deprecated) [SK.3.1.3.4]
<definition> is the definition of the concept [SK.3.1.4]
<example> contains an example of the concept; it has an id, a label and content [SK.3.1.6]
<note> contains additional information that supplements the terminological data (e.g. information
regarding the units applicable to a quantity, provisions relating to the use of a term, an explanation of
the reasons for selecting an abbreviated form as preferred term. It has an id, a label, and content
[SK.3.1.7]
<source> contains the source reference from which a concept has been repeated, together with
information about any modifications made [SK.3.1.8]
<pubdate> is the date of publication date of the concept
ronaldtse commented 9 months ago

The "opendata-api.iec.ch" host is gone. We need to ask IEC for an alternative.