interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Investigate Interscript maps for ISO 24229 spelling system codes #687

Open ronaldtse opened 3 years ago

ronaldtse commented 3 years ago

ISO 24229 was the basis for Interscript system codes.

With the progress of Interscript we are now aware that the original code "system" doesn't quite work: i.e. the pattern "{authority code}-{lang}-{source script}-{target-script}-{id}"

We now know there are systems that span multiple languages, or support different character sets (same script code, e.g. Latn, but different characters, e.g. diacritics).

So we need to introduce a new "spelling system code" for all languages and transliteration systems.

We need to:

The input and output spelling systems can be identified by the defined character sets (in an interscript map, the rule keys are the input character set, the rule values are the output character set).

This task is to do this.

First, we need to make a script to define the character input and output sets of the interscript systems.

bilashsaha commented 3 years ago

@ronaldtse I had wriiten a script to extract the character set of all the mapping we have. The script is as follows :

#!/usr/bin/env ruby

require 'yaml'

def process_inheritance(inheritance,input_character_set=[],output_character_set=[])
   parent_map_files = inheritance.kind_of?(Array) ? inheritance : [inheritance]
   parent_map_files.each do |map_file|
      parent_map_file = YAML.load_file("maps/#{map_file}.yaml")
      input_character_set << parent_map_file["map"]["characters"]&.keys
      output_character_set << parent_map_file["map"]["characters"]&.values
      unless parent_map_file["map"]["inherit"].nil?
         process_inheritance(parent_map_file["map"]["inherit"],input_character_set,output_character_set)
      end
   end
   return input_character_set, output_character_set
end

maps = Dir["maps/*.yaml"]
output = []

maps.each do |system_file|
   map_file = YAML.load_file(system_file)
   output_hash = {}
   output_hash["mapping_name"] = File.basename(system_file, ".yaml")
   output_hash["id"] = map_file["id"]
   output_hash["authority_id"] = map_file["authority_id"]
   output_hash["language"] = map_file["language"]
   output_hash["source_script"] = map_file["source_script"]
   output_hash["destination_script"] = map_file["destination_script"]
   output_hash["input_character_set"] = map_file["map"]["characters"]&.keys || []
   output_hash["output_character_set"] = map_file["map"]["characters"]&.values || []

   unless map_file["map"]["inherit"].nil?
      input_character_set,output_character_set = process_inheritance(map_file["map"]["inherit"])
      output_hash["input_character_set"] << input_character_set
      output_hash["output_character_set"] << output_character_set
   end

   output_hash["input_character_set"] = output_hash["input_character_set"].flatten.join(',')
   output_hash["output_character_set"] = output_hash["output_character_set"].flatten.join(',')
   output << output_hash
end

output.sort_by!{|map| map["source_script"]}

File.open('character-set/all_mapping.yaml', 'w') {|f| f.write output.to_yaml }

The output of this script is as follows :


- mapping_name: un-ben-Beng-Latn-2016
  id: 2016
  authority_id: un
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,গু,রু,শু,হু,ন্তু,স্তু,রূ,হৃ,\u0982,\u0981,\u0983,\u09cd\u200c,ৎ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,ক্ক,ক্ট,ক্ত,ক্ন,ক্ম,ক্র,ক্ল,ক্ব,ক্ষ,ক্ষ্ন,ক্ষ্ম,ক্ষ্ব,ক্স,গ্গ,গ্দ,গ্ধ,গ্ন,গ্ম,গ্র,গ্ল,ঘ্র,ঙ্ক,ঙ্গ,চ্চ,চ্ছ,চ্ছ্ব,চ্ঞ,জ্জ,জ্জ্ব,জ্ঝ,জ্ঞ,জ্ব,ঞ্চ,ঞ্ছ,ঞ্জ,ঞ্ঝ,ট্ট,ড্ড,ণ্ট,ণ্ঠ,ণ্ড,ত্ত,ত্ত্ব,ত্থ,ত্ন,ত্ম,ত্র,ত্ল,ত্ব,দ্দ,দ্দ্ব,দ্ধ,দ্ধ্ব,দ্ন,দ্ব,দ্ভ,দ্ম,দ্র,দ্ল,ধ্র,ন্ঠ,ন্ড,ন্ক,ন্ত,ন্ত্র,ন্থ,ন্দ,ন্দ্র,ন্ধ,ন্ন,ন্ম,ন্ব,প্ন,প্ত,প্প,প্র,প্ল,ফ্র,ব্জ,ব্দ,ব্ধ,ব্ব,ব্র,ভ্র,ম্প,ম্ব,ম্ভ,ম্ভ্র,ম্ম,ম্র,ম্ল,ল্ক,ল্ট,ল্ড,ল্ম,ল্ল,শ্চ,শ্ছ,শ্ত,শ্ন,শ্ম,শ্র,শ্ল,শ্ব,ষ্ক,ষ্ট,ষ্ট্র,ষ্ঠ,ষ্ঞ,ষ্প,ষ্ফ,স্ক,স্ক্র,স্খ,স্ত,স্ন,স্ম,স্র,স্ব,হ্ন,হ্ম,হ্র,হ্ল,ক্ট্র,ক্ত্র,ক্য,ক্ষ্ণ,খ্য,খ্র,গ্‌ণ,গ্ধ্য,গ্ধ্র,গ্ন্য,গ্ব,গ্য,গ্র্য,ঘ্ন,ঘ্য,ঙ্‌ক্ত,ঙ্ক্য,ঙ্ক্ষ,ঙ্খ,ঙ্গ্য,ঙ্ঘ,ঙ্ঘ্য,ঙ্ঘ্র,ঙ্ম,চ্ছ্র,চ্ব,চ্য,জ্য,জ্র,ট্ব,ট্ম,ট্য,ট্র,ড্ব,ড্য,ড্র,ড়্গ,ঢ্য,ঢ্র,ণ্ঠ্য,ণ্ড্য,ণ্ড্র,ণ্ঢ,ণ্ণ,ণ্ব,ণ্ম,ণ্য,ৎক,ত্ত্য,ত্ম্য,ত্য,ত্র্য,ৎল,ৎস,থ্ব,থ্য,থ্র,দ্গ,দ্ঘ,দ্ভ্র,দ্য,দ্র্য,ধ্ন,ধ্ব,ধ্ম,ধ্য,ন্ট,ন্ট্র,ন্ড্র,ন্ত্ব,ন্ত্য,ন্ত্র্য,ন্থ্র,ন্দ্য,ন্দ্ব,ন্ধ্য,ন্ধ্র,ন্য,প্ট,প্য,প্র্য,প্স,ফ্ল,ব্য,ব্ল,ভ্ব,ভ্য,ম্ন,ম্প্র,ম্ফ,ম্ব্র,ম্য,য্য,র্ক,র্ক্য,র্গ্য,র্ঘ্য,র্চ্য,র্জ্য,র্ণ্য,র্ত্য,র্থ্য,র্ব্য,র্ম্য,র্শ্য,র্ষ্য,র্হ্য,র্খ,র্গ,র্গ্র,র্ঘ,র্চ,র্ছ,র্জ,র্ঝ,র্ট,র্ড,র্ণ,র্ত,র্ত্র,র্থ,র্দ,র্দ্ব,র্দ্র,র্ধ,র্ধ্ব,র্ন,র্প,র্ফ,র্ভ,র্ম,র্য,র্ল,র্শ,র্শ্ব,র্ষ,র্স,র্হ,র্ঢ্য,ল্ক্য,ল্গ,ল্প,ল্‌ফ,ল্ফ,ল্ব,ল্‌ভ,ল্য,শ্য,ষ্ক্র,ষ্ট্য,ষ্ঠ্য,ষ্ণ,ষ্প্র,ষ্ব,ষ্ম,ষ্য,স্ট,স্ট্র,স্ত্ব,স্ত্য,স্ত্র,স্থ,স্থ্য,স্প,স্প্র,স্প্‌ল,স্ফ,স্য,স্ল,হ্ণ,হ্ব,হ্য
  output_character_set: a,ā,i,ī,u,ū,ṛ,e,ai,o,au,ā,i,ī,u,ū,ṛ,e,ai,o,au,gu,ru,shu,hu,ntu,stu,rū,hṛ,ṁ,m̐,ḥ,,t,ka,kha,ga,gha,ṅa,cha,chha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,j̱aA,ra,la,sha,ṣha,sa,ha,ṙa,ṙha,ya,kka,kṭa,kta,kna,kma,kra,kla,kva,kṣha,kṣhna,kṣma,kṣhva,ksa,gga,gda,gdha,gna,gma,gra,gla,ghra,ṅka,ṅga,chcha,chchha,chchhva,chña,jja,jjva,jjha,jña,jva,ñcha,ñchha,ñja,ñjha,ṭṭa,ḍḍa,ṇṭa,ṇṭha,ṇḍa,tta,ttva,ttha,tna,tma,tra,tla,tva,dda,ddva,ddha,ddhva,dna,dva,dbha,dma,dra,dla,dhra,nṭha,nḍa,nka,nta,ntra,ntha,nda,ndra,ndha,nna,nma,nva,pna,pta,ppa,pra,pla,phra,bja,bda,bdha,bba,bra,bhra,mpa,mba,mbha,mbhra,mma,mra,mla,lka,lṭa,lḍa,lma,lla,shcha,shchha,shta,shna,shma,shra,shla,shva,ṣhka,ṣhṭa,ṣhṭra,ṣhṭha,ṣhña,ṣhpa,ṣhpha,ska,skra,skha,sta,sna,sma,sra,sva,hna,hma,hra,hla,kṭra,ktra,kya,kṣṇa,khaj̱a,khra,gṇa,gdhya,gdhra,gnya,gva,gya,grya,ghna,ghya,ṅkata,ṅkaya,ṅkṣa,ṅkha,ṅgaya,ṅgha,ṅghya,ṅghra,ṅma,cchra,cva,cya,jya,jra,ṭva,ṭma,ṭya,ṭra,ḍva,ḍya,ḍra,ḍga,ḍhya,ḍhra,ṇṭhya,ṇḍya,ṇḍra,ṇḍha,ṇṇa,ṇva,ṇma,ṇya,tka,ttya,tmya,tya,trya,tla,tsa,thva,thya,thra,dga,dgha,dbhra,dya,draya,dhna,dhva,dhma,dya,nṭa,nṭra,nḍra,ntva,ntaya,ntraya,nthra,ndya,ndva,ndhya,ndhra,nya,pṭa,pya,praya,psa,phla,bya,bla,bhva,bhya,mna,mpra,mpha,mvra,mya,j̱aya,rka,rkya,rgya,rghya,rchya,rjya,rṇya,rtya,rthya,rvya,rmya,rshya,rṣhya,rhya,rkha,rga,rgra,rgha,rcha,rchha,rja,rjha,rṭa,rḍa,rṇa,rta,rtra,rtha,rda,rdva,rdra,rdha,rdhba,rna,rpa,rpha,rbha,rma,rya,rla,rsha,rshba,rṣha,rsa,rha,rḍhya,lkaya,lga,lpa,lpha,lpha,lba,lbha,lya,sya,ṣkra,ṣṭya,ṣṭhya,ṣṇa,ṣpra,ṣva,ṣma,ṣya,sṭa,sṭra,stva,stṣya,stra,stha,sthya,spa,spra,spala,spha,sya,sla,hṇa,hva,hya
- mapping_name: bis-ben-Beng-Latn-13194-1991
  id: 1991
  authority_id: bis
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: অ,আ,ই,ঈ,উ,ঊ,ৠ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ড়,ঢ,ঢ়,ণ,ত,ৎ,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,য়,য়,র,ল,শ,ষ,স,হ,ঁ,ঃ
    ,ং,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,\u09CD,्,़,।,‍
  output_character_set: a,ā,i,ī,u,ū,ṛ,ḻ,ē,ai,ŏ,au,k,kh,g,gh,ṅ,c,ch,j,jh,ñ,ṭ,ṭh,ḍ,d̂,ḍh,d̂h,ṇ,t,t,th,d,dh,n,p,ph,b,bh,m,y,ẏ,ẏ,r,l,ś,ṣ,s,h,m,ḥ,ṃ,ā,i,ī,u,ū,ṛ,ē,ai,ŏ,au,,,,.,
- mapping_name: un-asm-Beng-Latn-1972
  id: 1972
  authority_id: un
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,\u0982,\u0981,\u0983,\u09cd,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,ৰ,ল,ৱ,শ,ষ,স,হ,ৎ,ড়,ঢ়,য়,য়,ড়,ঢ়
  output_character_set: a,ā,i,ī,u,ū,ṛ,e,ai,o,au,ā,i,ī,u,ū,ṛ,e,ai,o,au,ṁ,m̐,ḥ,,ka,kha,ga,gha,ṅa,cha,chha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,j̱a,ra,la,va,sha,ṣha,sa,ha,t,ṙa,ṙha,ya,ya,ṙa,ya
- mapping_name: iso-pli-Beng-Latn-15919-2001
  id: 15919-2001
  authority_id: iso
  language: iso-639-2:pli
  source_script: Beng
  destination_script: Latn
  input_character_set: ৰ,ৱ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ৠ,ঌ,ৡ,এ,ঐ,ও,ঔ,া,ি,ী,ু,ূ,ৃ,ৄ,ৢ,ৣ,ে,ৈ,ো,ৌ,ং,ঁ,ঃ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,যঁ,রঁ,লঁ,বঁ,ৎ,্,১,২,৩,৪,৫,৬,৭,৮,৯,০
  output_character_set: va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ṁ,m̐,ḥ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ra,la,śa,ṣa,sa,ha,ṙa,ṙha,ẏa,m̐ya,m̐ra,m̐la,m̐va,t,,1,2,3,4,5,6,7,8,9,0
- mapping_name: iso-ben-Beng-Latn-15919-2001
  id: 15919-2001
  authority_id: iso
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: অ,আ,ই,ঈ,উ,ঊ,ঋ,ৠ,ঌ,ৡ,এ,ঐ,ও,ঔ,া,ি,ী,ু,ূ,ৃ,ৄ,ৢ,ৣ,ে,ৈ,ো,ৌ,ং,ঁ,ঃ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,যঁ,রঁ,লঁ,বঁ,ৎ,্,১,২,৩,৪,৫,৬,৭,৮,৯,০
  output_character_set: a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ṁ,m̐,ḥ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ra,la,śa,ṣa,sa,ha,ṙa,ṙha,ẏa,m̐ya,m̐ra,m̐la,m̐va,t,,1,2,3,4,5,6,7,8,9,0
- mapping_name: alalc-ben-Beng-Latn-1997
  id: 1997
  authority_id: alalc
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: অ,আ,ই,ঈ,উ,ঊ,এ,ঐ,ও,ঔ,ঋ,ৠ,ঌ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ড়,ড়,ঢ,ঢ়,ণ,ত,ৎ,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,য়,য়,র,ল,শ,ষ,স,হ,ং,ঃ,\u0981,ऽ,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,\u09cd,০,১,২,৩,৪,৫,৬,৭,৮,৯
  output_character_set: a,ā,i,ī,u,ū,e,ai,o,au,ṛ,ṝ,ḹ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ṛa,ṛa,ḍha,ṛha,ṇa,ta,ṯa,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ẏa,ẏa,ra,la,śa,sha,sa,ha,ṃ,ḥ,n̐,’,ā,i,ī,u,ū,ṛ,e,ai,o,au,,0,1,2,3,4,5,6,7,8,9
- mapping_name: iso-asm-Beng-Latn-15919-2001
  id: 15919-2001
  authority_id: iso
  language: iso-639-2:asm
  source_script: Beng
  destination_script: Latn
  input_character_set: ৰ,ৱ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ৠ,ঌ,ৡ,এ,ঐ,ও,ঔ,া,ি,ী,ু,ূ,ৃ,ৄ,ৢ,ৣ,ে,ৈ,ো,ৌ,ং,ঁ,ঃ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,যঁ,রঁ,লঁ,বঁ,ৎ,্,১,২,৩,৪,৫,৬,৭,৮,৯,০
  output_character_set: va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ṁ,m̐,ḥ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ra,la,śa,ṣa,sa,ha,ṙa,ṙha,ẏa,m̐ya,m̐ra,m̐la,m̐va,t,,1,2,3,4,5,6,7,8,9,0
- mapping_name: icao-ukr-Cyrl-Latn-9303
  id: 9303
  authority_id: icao
  language: iso-639-2:ukr
  source_script: Cyrl
  destination_script: Latn
  input_character_set: "',А,Б,Д,Ё,Е,Э,Ф,Г,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,В,Ы,З,Ч,Я,Ю,Х,Ш,Щ,Ц,Ж,Ґ,Ў,Ѫ,Ђ,Ѕ,Ј,Љ,Њ,Һ,Џ,Є,Ї,Ѓ,І,а,б,д,ё,е,э,ф,г,и,й,к,л,м,н,о,п,р,с,т,у,в,ы,з,ч,я,ю,х,ш,щ,ц,ж,ґ,ў,ѫ,ђ,ѕ,ј,љ,њ,һ,џ,є,ї,ѓ"
  output_character_set: ",A,B,D,E,E,E,F,G,Y,I,K,L,M,N,O,P,R,S,T,U,V,Y,Z,CH,IA,IU,KH,SH,SHCH,TS,ZH,G,U,U,D,DZ,J,LJ,NJ,C,DZ,IE,I,G,I,a,b,d,e,e,e,f,g,y,i,k,l,m,n,o,p,r,s,t,,v,y,z,ch,ia,i,kh,sh,shch,ts,zh,g,,,d,dz,j,lj,nj,c,dz,ie,i,g"
....................................

The file is big so I had pasted a chunk of it above. Let me know how we should process it further.

ronaldtse commented 3 years ago

Thanks @bilashsaha !

For output_character_set: va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ... can we have it broken down per char (but with diacritics): i.e. va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ => v,r,a,ā,i,ī,u,ū,ṝ,ḻ,ḹ...

bilashsaha commented 3 years ago

@ronaldtse I had further normalized and made readable(i.e. converted he Unicode to readable characters) the output character set as well as the input character set. The script now looks like as follows:

#!/usr/bin/env ruby

require 'yaml'

def process_inheritance(inheritance,input_character_set=[],output_character_set=[])
   parent_map_files = inheritance.kind_of?(Array) ? inheritance : [inheritance]
   parent_map_files.each do |map_file|
      parent_map_file = YAML.load_file("maps/#{map_file}.yaml")
      input_character_set << parent_map_file["map"]["characters"]&.keys
      output_character_set << parent_map_file["map"]["characters"]&.values
      unless parent_map_file["map"]["inherit"].nil?
         process_inheritance(parent_map_file["map"]["inherit"],input_character_set,output_character_set)
      end
   end
   return input_character_set, output_character_set
end

def unescape_unicode(s)
   s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
end

maps = Dir["maps/*.yaml"]
output = []

maps.each do |system_file|
   map_file = YAML.load_file(system_file)
   output_hash = {}
   output_hash["mapping_name"] = File.basename(system_file, ".yaml")
   output_hash["id"] = map_file["id"]
   output_hash["authority_id"] = map_file["authority_id"]
   output_hash["language"] = map_file["language"]
   output_hash["source_script"] = map_file["source_script"]
   output_hash["destination_script"] = map_file["destination_script"]
   output_hash["input_character_set"] = map_file["map"]["characters"]&.keys || []
   output_hash["output_character_set"] = map_file["map"]["characters"]&.values || []

   unless map_file["map"]["inherit"].nil?
      input_character_set,output_character_set = process_inheritance(map_file["map"]["inherit"])
      output_hash["input_character_set"] << input_character_set
      output_hash["output_character_set"] << output_character_set
   end

   output_hash["input_character_set"] = output_hash["input_character_set"].flatten.compact.map{|e| unescape_unicode(e).scan(/[[:graph:]]/)}.flatten.sort.uniq.join(",")
   output_hash["output_character_set"] = output_hash["output_character_set"].flatten.compact.map{|e| e.scan(/[[:graph:]]/)}.flatten.sort.uniq.join(",")
   output << output_hash
end

output.sort_by!{|map| map["source_script"]}

File.open('character-set/all_mapping.yaml', 'w') {|f| f.write output.to_yaml }

The output of the script looks like as follows :

- mapping_name: un-ben-Beng-Latn-2016
  id: 2016
  authority_id: un
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,‌
  output_character_set: A,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,̇,̐,̣,̱,ḍ,ḥ,ṅ,ṇ,ṙ,ṛ,ṣ,ṭ
- mapping_name: bis-ben-Beng-Latn-13194-1991
  id: 1991
  authority_id: bis
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: ़,्,।,ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,‍
  output_character_set: ".,a,b,c,d,g,h,i,j,k,l,m,n,p,r,s,t,u,y,ā,ē,ī,ŏ,ū,́,̂,̃,̇,̣,ḻ,ṛ,ṣ,ẏ"
- mapping_name: un-asm-Beng-Latn-1972
  id: 1972
  authority_id: un
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৰ,ৱ
  output_character_set: a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,̇,̐,̱,ḍ,ḥ,ṅ,ṇ,ṙ,ṛ,ṣ,ṭ
- mapping_name: iso-pli-Beng-Latn-15919-2001
  id: 15919-2001
  authority_id: iso
  language: iso-639-2:pli
  source_script: Beng
  destination_script: Latn
  input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,া,ি,ী,ু,ূ,ৃ,ৄ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,ৡ,ৢ,ৣ,০,১,২,৩,৪,৫,৬,৭,৮,৯,ৰ,ৱ
  output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,́,̇,̐,ḍ,ḥ,ḹ,ḻ,ṅ,ṇ,ṙ,ṛ,ṝ,ṣ,ṭ,ẏ
- mapping_name: iso-ben-Beng-Latn-15919-2001
  id: 15919-2001
  authority_id: iso
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,া,ি,ী,ু,ূ,ৃ,ৄ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,ৡ,ৢ,ৣ,০,১,২,৩,৪,৫,৬,৭,৮,৯
  output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,́,̇,̐,ḍ,ḥ,ḹ,ḻ,ṅ,ṇ,ṙ,ṛ,ṝ,ṣ,ṭ,ẏ
- mapping_name: alalc-ben-Beng-Latn-1997
  id: 1997
  authority_id: alalc
  language: iso-639-2:ben
  source_script: Beng
  destination_script: Latn
  input_character_set: ऽ,ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,য়,ৠ,০,১,২,৩,৪,৫,৬,৭,৮,৯
  output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,y,ā,ī,ū,́,̃,̄,̇,̐,̣,̱,ḹ,ṛ,ṝ,’
- mapping_name: iso-asm-Beng-Latn-15919-2001
  id: 15919-2001
  authority_id: iso
  language: iso-639-2:asm
  source_script: Beng
  destination_script: Latn
  input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,া,ি,ী,ু,ূ,ৃ,ৄ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,ৡ,ৢ,ৣ,০,১,২,৩,৪,৫,৬,৭,৮,৯,ৰ,ৱ
  output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,́,̇,̐,ḍ,ḥ,ḹ,ḻ,ṅ,ṇ,ṙ,ṛ,ṝ,ṣ,ṭ,ẏ
- mapping_name: icao-ukr-Cyrl-Latn-9303
  id: 9303
  authority_id: icao
  language: iso-639-2:ukr
  source_script: Cyrl
  destination_script: Latn
  input_character_set: "',Ё,Ђ,Ѓ,Є,Ѕ,І,Ї,Ј,Љ,Њ,Ў,Џ,А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,Щ,Ы,Э,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ы,э,ю,я,ё,ђ,ѓ,є,ѕ,ї,ј,љ,њ,ў,џ,Ѫ,ѫ,Ґ,ґ,Һ,һ"
  output_character_set: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,R,S,T,U,V,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,r,s,t,v,y,z
- mapping_name: bgnpcgn-tat-Cyrl-Latn-2007
  id: 2007
  authority_id: bgnpcgn
  language: iso-639-2:tat
  source_script: Cyrl
  destination_script: Latn
  input_character_set: А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,Щ,Ъ,Ы,Ь,Э,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ъ,ы,ь,э,ю,я,Җ,җ,Ң,ң,Ү,ү,Һ,һ,Ә,ә,Ө,ө
  output_character_set: A,B,C,D,E,F,H,I,J,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z,a,b,c,d,e,f,h,i,j,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,Ä,Ç,Ñ,Ö,Ü,ä,ç,ñ,ö,ü,Ğ,ğ,İ,ı,Ş,ş,Ə,ə,Х,’,Ꞑ,ꞑ
- mapping_name: bgnpcgn-bul-Cyrl-Latn-2013
  id: 2013
  authority_id: bgnpcgn
  language: iso-639-2:bul
  source_script: Cyrl
  destination_script: Latn
  input_character_set: А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,Щ,Ъ,Ь,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ъ,ь,ю,я,Ѣ,ѣ,Ѫ,ѫ
  output_character_set: "',A,B,C,D,E,F,G,I,K,L,M,N,O,P,R,S,T,U,V,Y,Z,\\,a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,v,y,z,̆"
- mapping_name: odni-srp-Cyrl-Latn-2015
  id: 2015
  authority_id: odni
  language: iso-639-2:srp
  source_script: Cyrl
  destination_script: Latn
  input_character_set: Ђ,Ј,Љ,Њ,Ћ,Џ,А,Б,В,Г,Д,Е,Ж,З,И,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,а,б,в,г,д,е,ж,з,и,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,ђ,ј,љ,њ,ћ,џ
  output_character_set: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,R,S,T,U,V,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,z
- mapping_name: odni-uig-Cyrl-Latn-2015
  id: 2015
  authority_id: odni
  language: iso-639-2:uig
  source_script: Cyrl
  destination_script: Latn
  input_character_set: Ё,А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ч,Ш,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ч,ш,ю,я,ё,Ғ,ғ,Җ,җ,Қ,қ,Ң,ң,Ү,ү,Һ,һ,Ә,ә,Ө,ө
  output_character_set: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,w,x,y,z
ronaldtse commented 3 years ago

@bilashsaha the output looks good, but this row seems a bit strange?

  output_character_set: "',A,B,C,D,E,F,G,I,K,L,M,N,O,P,R,S,T,U,V,Y,Z,\\,a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,v,y,z,̆"

It contains \\?

bilashsaha commented 3 years ago

@ronaldtse Yes I have investigated and found that bgnpcgn-bul-Cyrl-Latn-2013 is inheriting bgnpcgn-bul-Cyrl-Latn-1952 and atbgnpcgn-bul-Cyrl-Latn-1952 there exists a mapping '\u042c': "\\'" at line : https://github.com/interscript/interscript/blob/master/maps/bgnpcgn-bul-Cyrl-Latn-1952.yaml#L79

ronaldtse commented 3 years ago

@bilashsaha in this particular map bgnpcgn-bul-Cyrl-Latn-1952, the rule is supposed to be:

\u042c => '

Can you help fix this?

image
CAMOBAP commented 3 years ago

Regarding top message:

We now know there are systems that span multiple languages,

We have proof-of-concept for multi-language systems https://github.com/interscript/interscript/pull/570

or support different character sets (same script code, e.g. Latn, but different characters, e.g. diacritics).

Proposed approach

I propose a similar approach for "output characters":

Benefits:

Open questions

@ronaldtse @bilashsaha please review the concept

ronaldtse commented 3 years ago

Ping @ribose-jeffreylau to follow up on this.