glossarist / iev-data

1 stars 1 forks source link

IEV data anomaly (20201217): SOURCE contains document titles #62

Open ronaldtse opened 3 years ago

ronaldtse commented 3 years ago
[RAW] ISO/IEC Guide 99:2007, <i>International vocabulary of metrology – Basic and general concepts and associated terms (VIM)</i>, 4.5, modified – The definition, and the term in English, have been replaced to adapt to usage in control technology.
{:source_ref=>"JCGM VIM",
 :clause=>"4.5",
 :relation_type=>
  {:type=>:modified,
   :modification=>
    "The definition, and the term in English, have been replaced to adapt to usage in control technology."}}

Processing term 351-57-01 (eng)... extract_source_clause 'ISO/IEC Guide 51:1999, <i>Safety aspects – Guidelines for their inclusion in standards</i>, 3.5'
[RAW] ISO/IEC Guide 51:1999, <i>Safety aspects – Guidelines for their inclusion in standards</i>, 3.5
{:source_ref=>"ISO/IEC Guide 51:1999",
 :clause=>"3.5",
 :relation_type=>{:type=>:identical}}

I don't think we're supposed to have document titles inside the SOURCE field.

ronaldtse commented 3 years ago

For example:

"ISO/IEC/IEEE 24765:2010, Systems and software engineering – Vocabulary, 3.234 (2), modified by omission of "type of" or other relevant words and addition of Notes 1 and 2”

Using the semantic approach I don’t think we need to maintain the document title since it is obtainable from the referred document itself.

We should creating the following structure to accommodate this:

Sought clarification from IEC.

ronaldtse commented 3 years ago

From IEC:

It does not matter now, but [...] would use different terms for what I have put in blue.

We will have to change the terms of "document", "clause", "relationship" and "modification" later on.

On the contrary, ISO 10241-1:2011 contains some examples of source references with titles

image003 image005

although I agree that for a standard, one would not normally include the title as described in ISO 10241-1:2011, 6.8, but this is only a recommendation: “The indication of the source should be in coded form and a link or reference to a standard bibliographic description provided.”

Meanwhile, since the rules do not prohibit the inclusion of a title, we should not either.

Maybe you should allow for a short form (without a title) and a long form (with a title) of an xref, where the short form is the default?

So we should allow for a short form (without title) and also a long form (with title, even in the case of a standard), depending on user preference.

Technically, we should allow entering references in ISO 690 format because ISO 10241-1 accepts only the ISO 690 bibliographic format...

This needs to be dealt with in the concept-model and Glossarist.

By the way are you aware of the following rule: image008

This rule applies when a SOURCE only applies to a single language term. We will need to deal with it in Glossarist.

ronaldtse commented 3 years ago

@skalee this means we need to retain the original title in these entries in the resulting data file and rendering. Can you help with this? Thanks.

skalee commented 3 years ago

@ronaldtse The question is how to detect that title. Anything between <i> and </i>, perhaps? Or anything that is not ref nor clause nor modification comment, perhaps? The latter may be polluted with some additional text.

Please also note that we do that already to some degree, as the original field contains unparsed SOURCE column value. For example:

eng:
  id: 351-57-01
  authoritative_source:
  - ref: ISO/IEC Guide 51:1999
    clause: '3.5'
    link: https://www.iso.org/standard/32893.html
    relationship:
      type: identical
    original: ISO/IEC Guide 51:1999, <i>Safety aspects – Guidelines for their inclusion
      in standards</i>, 3.5
ronaldtse commented 3 years ago

@skalee while <i> occurs in more places in the SOURCE than just titles (it is also used for symbols inside the "modified" note), but if we just select all <i>...</i> that contains length more than 3 it should work.

I don't know how many document titles are provided are not enclosed in <i> though.

Let's take a narrow approach that we only extract document titles, not the "anything that is not x or y" approach. Thanks!

ronaldtse commented 3 years ago

@skalee how's this issue going? To move forward let's find a list of "original: {docidentifier}, ... , {clause number}" and find out what the ... is. Then we can extract the title, probably as ref_title:.