glossarist / iev-data

1 stars 1 forks source link

Parentheses in TERMATTRIBUTES #137

Open skalee opened 3 years ago

skalee commented 3 years ago

Examples:

IEVREF           LANGUAGE         TERM                            TERMATTRIBUTE
---------------  ---------------  ------------------------------  ----------------------------------------
721-13-08        en               telewriter                      (2)
723-03-57        en               synchronized network            (sound broadcasting)
723-06-03        en               distortion of short duration    (of a picture)
723-06-69        en               crosstalk                       (in television)
723-08-28        en               luminous intensity              (of a source, in one direction)

How to interpret such ones? How to represent them in YAMLs?

cc @ronaldtse

ronaldtse commented 3 years ago

I have sought clarification from IEC.

Note that for 721-13-08, “(2)” probably means a second definition because there is also 721-12-13 which is also “telewriter" with TERMATTRUBUTE set to “(1)”.

@skalee can you help get all the TERMATTRIBUTE values that contain parentheses and post them here? Thanks.

skalee commented 3 years ago

There are hundreds of them!

Also, often something parseable is in parentheses, eg. (adj) or (US). Obviously these must be handled separately.

ronaldtse commented 3 years ago

Hmmm. So I guess we have to differentiate the coded data vs plain text inside parentheses.

skalee commented 3 years ago

3000 or so: https://gist.github.com/skalee/711596f60b8afb2fe682019f2130bcab

Hmmm. So I guess we have to differentiate the coded data vs plain text inside parentheses.

I guess everything what cannot be recognized should be coded in some kind of free text field. Sadly, we'll most likely overlook some attributes which should be parsed and misinterpret them as free text.

ronaldtse commented 3 years ago

I guess everything what cannot be recognized should be coded in some kind of free text field. Sadly, we'll most likely overlook some attributes which should be parsed and misinterpret them as free text.

As long as we don't lose any data, it's good enough. I think we can do a pretty good job at recognising these 3000 entries.