interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Make the metadata format consistent across all systems #725

Closed webdev778 closed 11 months ago

webdev778 commented 3 years ago

Description

Checking the present systems, we can easily notice that metadata's properties and content types and formats vary from system to system a little bit. A few of them have some special keys like special_rules original_description original_notes implementation_notes that others don't have. Also regarding Description there are two standards. One is

description: {en: "Hello", ru: "Здраствуйте"}

another is

  description: "Hello"
  original_description: "Здраствуйте"

Further, the format of description is not consistent.

  description: |
    The BGN/PCGN system for Armenian was designed for use in romanizing
    names written in the Armenian alphabet. The Roman letters and letter
    combinations shown as equivalents to the Armenian characters reflect
    the eastern variety of Armenian, i.e. the language spoken in the
    Republic of Armenia.
  description:
    - This system is commonly used for the transliteration of place names or
      person's names in Hong Kong, as pronounced in Cantonese. There will be more
      than one legitimate transliteration for the same syllable, or sometimes even
      for the same character. For example, the character 仔 can be transcribed as
      Chai or Tsai in this system. Some of the choice is context-depenedent (e.g.
      the same character in the place name 灣仔 is almost always Chai, but more likely
      to be Tsai elsewhere). There will be more variations and unpredictabilities in
      person's names, and these conventions need to be hard-coded.

which will be resulting in differently in javascript

"description": "[\"This system is commonly used for the transliteration of place names or person's names in Hong Kong, as pronounced in Cantonese. There will be more than one legitimate transliteration for the same syllable, or sometimes even for the same character. For example, the character 仔 can be transcribed as Chai or Tsai in this system. Some of the choice is context-depenedent (e.g. the same character in the place name 灣仔 is almost always Chai, but more likely to be Tsai elsewhere). There will be more variations and unpredictabilities in person's names, and these conventions need to be hard-coded.\"]",

"description": "The BGN/PCGN system for Armenian was designed for use in romanizing\nnames written in the Armenian alphabet. The Roman letters and letter\ncombinations shown as equivalents to the Armenian characters reflect\nthe eastern variety of Armenian, i.e. the language spoken in the\nRepublic of Armenia.\n",

And this inconsistency is causing several display issues on interscript.org repo This issue has been noticed originally from this PR #724 while exposing metadata.json to javascript side.

the issue(37)

for ex: mext-jpn-Hrkt-Latn-1954 map have a different format for description property than others

  description:
    jp: |
      国語を書き表わす場合に用いるローマ字のつづり方を次のように定める。

      まえがき
      1 一般に国語を書き表わす場合は、第1表に掲げたつづり方によるものとする。
      2 国際的関係その他従来の慣例をにわかに改めがたい事情にある場合に限り、第2表に掲げたつづり方によつてもさしつかえない。
      3 前二項のいずれの場合においても、おおむねそえがきを適用する。
    en: |
      The spelling method for Roman characters used when writing Japanese language is as follows.

      Preface
      1. In general, when the language is written, the spelling shown in Table 1 shall be used.
      2. The spelling methods listed in Table 2 can be used only when there is a situation that is difficult to change due to international relations or other conventional practices.
      3. In either case of the preceding two paragraphs, the general introduction will apply.

hk-yue-Hani-Latn-1888

  description:
    - This system is commonly used for the transliteration of place names or
      person's names in Hong Kong, as pronounced in Cantonese. There will be more
      than one legitimate transliteration for the same syllable, or sometimes even
      for the same character. For example, the character 仔 can be transcribed as
      Chai or Tsai in this system. Some of the choice is context-depenedent (e.g.
      the same character in the place name 灣仔 is almost always Chai, but more likely
      to be Tsai elsewhere). There will be more variations and unpredictabilities in
      person's names, and these conventions need to be hard-coded.

What's the Standard?

Here is a draft standard and open to discuss

  STANDARD_STRING_KEYS = %i{authority_id id
  language source_script destination_script
  name url creation_date adoption_date description
  character source confirmation_date}

  STANDARD_ARRAY_KEYS = %i{notes}

  NONSTANDARD_KEYS = %i{special_rules original_description original_notes
    implementation_notes}

  NECESSARY_KEYS = %i{name language source_script destination_script}

Problematic Systems

Fortunately, a small amount of systems have problems.

[az-aze-Cyrl-Latn-1939] Necessary key name wasn't defined. Defaulting to an empty string
[az-aze-Cyrl-Latn-1958] Necessary key name wasn't defined. Defaulting to an empty string
[bgnpcgn-fas-Arab-Latn-1956] Metadata key special_rules is non-standard
[ua-ukr-Cyrl-Latn-2010] Metadata key url expects a String, but Array was given
[iso-ell-Grek-Latn-843-1997-t1] Metadata key notes expects all Array elements to be String
[elot-ell-Grek-Latn-743-2001-tl] Metadata key notes expects all Array elements to be String
[iso-ell-Grek-Latn-843-1997-t2] Metadata key notes expects all Array elements to be String
[elot-ell-Grek-Latn-743-2001-ts] Metadata key notes expects all Array elements to be String
[hk-yue-Hani-Latn-1888] Metadata key description expects a String, but Array was given
[iso-ara-Arab-Latn-233-1984] Metadata key url expects a String, but Array was given
[mext-jpn-Hrkt-Latn-1954] Metadata key description expects a String, but Hash was given
[mext-jpn-Hrkt-Latn-1954] Metadata key notes expects all Array elements to be String
[mext-jpn-Hrkt-Latn-1954] Metadata key notes expects all Array elements to be String
[mext-jpn-Hrkt-Latn-1954] Metadata key notes expects all Array elements to be String
[mext-jpn-Hrkt-Latn-1954] Metadata key notes expects all Array elements to be String
[mext-jpn-Hrkt-Latn-1954] Metadata key notes expects all Array elements to be String
[mext-jpn-Hrkt-Latn-1954] Metadata key notes expects all Array elements to be String
[lshk-yue-Hani-Latn-jyutping-1993] Metadata key description expects a String, but Array was given
[sasm-mon-Mong-Latn-general-1978] Metadata key original_description is non-standard
[sasm-mon-Mong-Latn-general-1978] Metadata key original_notes is non-standard
[sasm-mon-Mong-Latn-general-1978] Metadata key implementation_notes is non-standard
[sasm-mon-Mong-Latn-phonetic-1978] Metadata key original_description is non-standard
[sasm-mon-Mong-Latn-phonetic-1978] Metadata key original_notes is non-standard
[sasm-mon-Mong-Latn-phonetic-1978] Metadata key implementation_notes is non-standard
[un-ell-Grek-Latn-1987-phonetic] Metadata key notes expects all Array elements to be String
[un-ell-Grek-Latn-1987-phonetic] Metadata key notes expects all Array elements to be String
[un-ell-Grek-Latn-1987-phonetic] Metadata key notes expects all Array elements to be String
[un-ell-Grek-Latn-1987-phonetic] Metadata key notes expects all Array elements to be String
[un-mon-Mong-Latn-general-2013] Metadata key implementation_notes is non-standard
[un-mon-Mong-Latn-phonetic-2013] Metadata key implementation_notes is non-standard
ronaldtse commented 3 years ago

We should develop a data model in LutaML for this: https://github.com/lutaml/lutaml-uml

i.e.

  STANDARD_STRING_KEYS = %i{authority_id id
  language source_script destination_script
  name url creation_date adoption_date description
  character source confirmation_date}

  STANDARD_ARRAY_KEYS = %i{notes}

  NONSTANDARD_KEYS = %i{special_rules original_description original_notes
    implementation_notes}

  NECESSARY_KEYS = %i{name language source_script destination_script}

=> in LutaML syntax:

class SystemMetadata {
  definition {
    Describes metadata of a script conversion system.
  }

  authority_id: String {
    definition {
      Authority identifier.
    }
  }
  id: String {
    definition {
      Identifier of this system.
    }
  }
  language: Iso639Code[0..*] {
    definition {
      Language that the system processes expressed as an ISO 639 code.
    }
  }
  sourceScript: Iso15924Code {
    definition {
      Script system of the input text, expressed as an ISO 15924 code.
    }
  }
  destinationScript: Iso15924Code {
    definition {
      Script system of the output text, expressed as an ISO 15924 code.
    }
  }
  name: LocalizedStrings {
    definition {
      Name of the system.
    }
  }
  url: String {
    definition {
      URL of the source document.
    }
  }
  creationDate: Iso8601Date {
    definition {
      Date on which this system was first created.
    }
  }
  adoptionDate: Iso8601Date {
    definition {
      Date on which this system was adopted.
    }
  }
  confirmationDate: Iso8601Date {
    definition {
      Date on which this system was last confirmed.
    }
  }
  description: LocalizedStrings[0..*] {
    definition {
      Description of this system.
    }
  }
  notes: LocalizedStrings[0..*] {
    definition {
      Notes about this system.
    }
  }

  originalDescription: LocalizedStrings[0..*] {
    definition {
      Description from source document.
    }
  }

  originalNotes: LocalizedStrings[0..*] {
    definition {
      Notes from source document.
    }
  }

  implementationNotes: LocalizedStrings[0..*] {
    definition {
      Implementation notes.
    }
  }

  // TODO: what are these?
  character: String
  source: String
  special_rules: String
}

class LocalizedStrings {
  definition {
    String in multiple languages.
  }
  content: LocalizedString[0..*]
}

class LocalizedString {
  definition {
    String in a particular language.
  }
  languageCode: Iso639Code
  scriptCode: Iso15924Code
  string: String
}
ronaldtse commented 2 years ago

@webdev778 can you help fix/convert the erroneous metadata format of those systems listed? Thanks!

webdev778 commented 11 months ago

This issue has been fixed by the commit interscript/maps@78ca96cfd907a208bb3e91400ce1bfb1372804a2

and this commit https://github.com/interscript/interscript-ruby/commit/873318113627bc577fec2ffafd37d5b536e59420