Open ronaldtse opened 3 years ago
@ronaldtse I had wriiten a script to extract the character set of all the mapping we have. The script is as follows :
#!/usr/bin/env ruby
require 'yaml'
def process_inheritance(inheritance,input_character_set=[],output_character_set=[])
parent_map_files = inheritance.kind_of?(Array) ? inheritance : [inheritance]
parent_map_files.each do |map_file|
parent_map_file = YAML.load_file("maps/#{map_file}.yaml")
input_character_set << parent_map_file["map"]["characters"]&.keys
output_character_set << parent_map_file["map"]["characters"]&.values
unless parent_map_file["map"]["inherit"].nil?
process_inheritance(parent_map_file["map"]["inherit"],input_character_set,output_character_set)
end
end
return input_character_set, output_character_set
end
maps = Dir["maps/*.yaml"]
output = []
maps.each do |system_file|
map_file = YAML.load_file(system_file)
output_hash = {}
output_hash["mapping_name"] = File.basename(system_file, ".yaml")
output_hash["id"] = map_file["id"]
output_hash["authority_id"] = map_file["authority_id"]
output_hash["language"] = map_file["language"]
output_hash["source_script"] = map_file["source_script"]
output_hash["destination_script"] = map_file["destination_script"]
output_hash["input_character_set"] = map_file["map"]["characters"]&.keys || []
output_hash["output_character_set"] = map_file["map"]["characters"]&.values || []
unless map_file["map"]["inherit"].nil?
input_character_set,output_character_set = process_inheritance(map_file["map"]["inherit"])
output_hash["input_character_set"] << input_character_set
output_hash["output_character_set"] << output_character_set
end
output_hash["input_character_set"] = output_hash["input_character_set"].flatten.join(',')
output_hash["output_character_set"] = output_hash["output_character_set"].flatten.join(',')
output << output_hash
end
output.sort_by!{|map| map["source_script"]}
File.open('character-set/all_mapping.yaml', 'w') {|f| f.write output.to_yaml }
The output of this script is as follows :
- mapping_name: un-ben-Beng-Latn-2016
id: 2016
authority_id: un
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,গু,রু,শু,হু,ন্তু,স্তু,রূ,হৃ,\u0982,\u0981,\u0983,\u09cd\u200c,ৎ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,ক্ক,ক্ট,ক্ত,ক্ন,ক্ম,ক্র,ক্ল,ক্ব,ক্ষ,ক্ষ্ন,ক্ষ্ম,ক্ষ্ব,ক্স,গ্গ,গ্দ,গ্ধ,গ্ন,গ্ম,গ্র,গ্ল,ঘ্র,ঙ্ক,ঙ্গ,চ্চ,চ্ছ,চ্ছ্ব,চ্ঞ,জ্জ,জ্জ্ব,জ্ঝ,জ্ঞ,জ্ব,ঞ্চ,ঞ্ছ,ঞ্জ,ঞ্ঝ,ট্ট,ড্ড,ণ্ট,ণ্ঠ,ণ্ড,ত্ত,ত্ত্ব,ত্থ,ত্ন,ত্ম,ত্র,ত্ল,ত্ব,দ্দ,দ্দ্ব,দ্ধ,দ্ধ্ব,দ্ন,দ্ব,দ্ভ,দ্ম,দ্র,দ্ল,ধ্র,ন্ঠ,ন্ড,ন্ক,ন্ত,ন্ত্র,ন্থ,ন্দ,ন্দ্র,ন্ধ,ন্ন,ন্ম,ন্ব,প্ন,প্ত,প্প,প্র,প্ল,ফ্র,ব্জ,ব্দ,ব্ধ,ব্ব,ব্র,ভ্র,ম্প,ম্ব,ম্ভ,ম্ভ্র,ম্ম,ম্র,ম্ল,ল্ক,ল্ট,ল্ড,ল্ম,ল্ল,শ্চ,শ্ছ,শ্ত,শ্ন,শ্ম,শ্র,শ্ল,শ্ব,ষ্ক,ষ্ট,ষ্ট্র,ষ্ঠ,ষ্ঞ,ষ্প,ষ্ফ,স্ক,স্ক্র,স্খ,স্ত,স্ন,স্ম,স্র,স্ব,হ্ন,হ্ম,হ্র,হ্ল,ক্ট্র,ক্ত্র,ক্য,ক্ষ্ণ,খ্য,খ্র,গ্ণ,গ্ধ্য,গ্ধ্র,গ্ন্য,গ্ব,গ্য,গ্র্য,ঘ্ন,ঘ্য,ঙ্ক্ত,ঙ্ক্য,ঙ্ক্ষ,ঙ্খ,ঙ্গ্য,ঙ্ঘ,ঙ্ঘ্য,ঙ্ঘ্র,ঙ্ম,চ্ছ্র,চ্ব,চ্য,জ্য,জ্র,ট্ব,ট্ম,ট্য,ট্র,ড্ব,ড্য,ড্র,ড়্গ,ঢ্য,ঢ্র,ণ্ঠ্য,ণ্ড্য,ণ্ড্র,ণ্ঢ,ণ্ণ,ণ্ব,ণ্ম,ণ্য,ৎক,ত্ত্য,ত্ম্য,ত্য,ত্র্য,ৎল,ৎস,থ্ব,থ্য,থ্র,দ্গ,দ্ঘ,দ্ভ্র,দ্য,দ্র্য,ধ্ন,ধ্ব,ধ্ম,ধ্য,ন্ট,ন্ট্র,ন্ড্র,ন্ত্ব,ন্ত্য,ন্ত্র্য,ন্থ্র,ন্দ্য,ন্দ্ব,ন্ধ্য,ন্ধ্র,ন্য,প্ট,প্য,প্র্য,প্স,ফ্ল,ব্য,ব্ল,ভ্ব,ভ্য,ম্ন,ম্প্র,ম্ফ,ম্ব্র,ম্য,য্য,র্ক,র্ক্য,র্গ্য,র্ঘ্য,র্চ্য,র্জ্য,র্ণ্য,র্ত্য,র্থ্য,র্ব্য,র্ম্য,র্শ্য,র্ষ্য,র্হ্য,র্খ,র্গ,র্গ্র,র্ঘ,র্চ,র্ছ,র্জ,র্ঝ,র্ট,র্ড,র্ণ,র্ত,র্ত্র,র্থ,র্দ,র্দ্ব,র্দ্র,র্ধ,র্ধ্ব,র্ন,র্প,র্ফ,র্ভ,র্ম,র্য,র্ল,র্শ,র্শ্ব,র্ষ,র্স,র্হ,র্ঢ্য,ল্ক্য,ল্গ,ল্প,ল্ফ,ল্ফ,ল্ব,ল্ভ,ল্য,শ্য,ষ্ক্র,ষ্ট্য,ষ্ঠ্য,ষ্ণ,ষ্প্র,ষ্ব,ষ্ম,ষ্য,স্ট,স্ট্র,স্ত্ব,স্ত্য,স্ত্র,স্থ,স্থ্য,স্প,স্প্র,স্প্ল,স্ফ,স্য,স্ল,হ্ণ,হ্ব,হ্য
output_character_set: a,ā,i,ī,u,ū,ṛ,e,ai,o,au,ā,i,ī,u,ū,ṛ,e,ai,o,au,gu,ru,shu,hu,ntu,stu,rū,hṛ,ṁ,m̐,ḥ,,t,ka,kha,ga,gha,ṅa,cha,chha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,j̱aA,ra,la,sha,ṣha,sa,ha,ṙa,ṙha,ya,kka,kṭa,kta,kna,kma,kra,kla,kva,kṣha,kṣhna,kṣma,kṣhva,ksa,gga,gda,gdha,gna,gma,gra,gla,ghra,ṅka,ṅga,chcha,chchha,chchhva,chña,jja,jjva,jjha,jña,jva,ñcha,ñchha,ñja,ñjha,ṭṭa,ḍḍa,ṇṭa,ṇṭha,ṇḍa,tta,ttva,ttha,tna,tma,tra,tla,tva,dda,ddva,ddha,ddhva,dna,dva,dbha,dma,dra,dla,dhra,nṭha,nḍa,nka,nta,ntra,ntha,nda,ndra,ndha,nna,nma,nva,pna,pta,ppa,pra,pla,phra,bja,bda,bdha,bba,bra,bhra,mpa,mba,mbha,mbhra,mma,mra,mla,lka,lṭa,lḍa,lma,lla,shcha,shchha,shta,shna,shma,shra,shla,shva,ṣhka,ṣhṭa,ṣhṭra,ṣhṭha,ṣhña,ṣhpa,ṣhpha,ska,skra,skha,sta,sna,sma,sra,sva,hna,hma,hra,hla,kṭra,ktra,kya,kṣṇa,khaj̱a,khra,gṇa,gdhya,gdhra,gnya,gva,gya,grya,ghna,ghya,ṅkata,ṅkaya,ṅkṣa,ṅkha,ṅgaya,ṅgha,ṅghya,ṅghra,ṅma,cchra,cva,cya,jya,jra,ṭva,ṭma,ṭya,ṭra,ḍva,ḍya,ḍra,ḍga,ḍhya,ḍhra,ṇṭhya,ṇḍya,ṇḍra,ṇḍha,ṇṇa,ṇva,ṇma,ṇya,tka,ttya,tmya,tya,trya,tla,tsa,thva,thya,thra,dga,dgha,dbhra,dya,draya,dhna,dhva,dhma,dya,nṭa,nṭra,nḍra,ntva,ntaya,ntraya,nthra,ndya,ndva,ndhya,ndhra,nya,pṭa,pya,praya,psa,phla,bya,bla,bhva,bhya,mna,mpra,mpha,mvra,mya,j̱aya,rka,rkya,rgya,rghya,rchya,rjya,rṇya,rtya,rthya,rvya,rmya,rshya,rṣhya,rhya,rkha,rga,rgra,rgha,rcha,rchha,rja,rjha,rṭa,rḍa,rṇa,rta,rtra,rtha,rda,rdva,rdra,rdha,rdhba,rna,rpa,rpha,rbha,rma,rya,rla,rsha,rshba,rṣha,rsa,rha,rḍhya,lkaya,lga,lpa,lpha,lpha,lba,lbha,lya,sya,ṣkra,ṣṭya,ṣṭhya,ṣṇa,ṣpra,ṣva,ṣma,ṣya,sṭa,sṭra,stva,stṣya,stra,stha,sthya,spa,spra,spala,spha,sya,sla,hṇa,hva,hya
- mapping_name: bis-ben-Beng-Latn-13194-1991
id: 1991
authority_id: bis
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: অ,আ,ই,ঈ,উ,ঊ,ৠ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ড়,ঢ,ঢ়,ণ,ত,ৎ,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,য়,য়,র,ল,শ,ষ,স,হ,ঁ,ঃ
,ং,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,\u09CD,्,़,।,
output_character_set: a,ā,i,ī,u,ū,ṛ,ḻ,ē,ai,ŏ,au,k,kh,g,gh,ṅ,c,ch,j,jh,ñ,ṭ,ṭh,ḍ,d̂,ḍh,d̂h,ṇ,t,t,th,d,dh,n,p,ph,b,bh,m,y,ẏ,ẏ,r,l,ś,ṣ,s,h,m,ḥ,ṃ,ā,i,ī,u,ū,ṛ,ē,ai,ŏ,au,,,,.,
- mapping_name: un-asm-Beng-Latn-1972
id: 1972
authority_id: un
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,\u0982,\u0981,\u0983,\u09cd,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,ৰ,ল,ৱ,শ,ষ,স,হ,ৎ,ড়,ঢ়,য়,য়,ড়,ঢ়
output_character_set: a,ā,i,ī,u,ū,ṛ,e,ai,o,au,ā,i,ī,u,ū,ṛ,e,ai,o,au,ṁ,m̐,ḥ,,ka,kha,ga,gha,ṅa,cha,chha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,j̱a,ra,la,va,sha,ṣha,sa,ha,t,ṙa,ṙha,ya,ya,ṙa,ya
- mapping_name: iso-pli-Beng-Latn-15919-2001
id: 15919-2001
authority_id: iso
language: iso-639-2:pli
source_script: Beng
destination_script: Latn
input_character_set: ৰ,ৱ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ৠ,ঌ,ৡ,এ,ঐ,ও,ঔ,া,ি,ী,ু,ূ,ৃ,ৄ,ৢ,ৣ,ে,ৈ,ো,ৌ,ং,ঁ,ঃ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,যঁ,রঁ,লঁ,বঁ,ৎ,্,১,২,৩,৪,৫,৬,৭,৮,৯,০
output_character_set: va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ṁ,m̐,ḥ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ra,la,śa,ṣa,sa,ha,ṙa,ṙha,ẏa,m̐ya,m̐ra,m̐la,m̐va,t,,1,2,3,4,5,6,7,8,9,0
- mapping_name: iso-ben-Beng-Latn-15919-2001
id: 15919-2001
authority_id: iso
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: অ,আ,ই,ঈ,উ,ঊ,ঋ,ৠ,ঌ,ৡ,এ,ঐ,ও,ঔ,া,ি,ী,ু,ূ,ৃ,ৄ,ৢ,ৣ,ে,ৈ,ো,ৌ,ং,ঁ,ঃ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,যঁ,রঁ,লঁ,বঁ,ৎ,্,১,২,৩,৪,৫,৬,৭,৮,৯,০
output_character_set: a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ṁ,m̐,ḥ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ra,la,śa,ṣa,sa,ha,ṙa,ṙha,ẏa,m̐ya,m̐ra,m̐la,m̐va,t,,1,2,3,4,5,6,7,8,9,0
- mapping_name: alalc-ben-Beng-Latn-1997
id: 1997
authority_id: alalc
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: অ,আ,ই,ঈ,উ,ঊ,এ,ঐ,ও,ঔ,ঋ,ৠ,ঌ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ড়,ড়,ঢ,ঢ়,ণ,ত,ৎ,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,য়,য়,র,ল,শ,ষ,স,হ,ং,ঃ,\u0981,ऽ,\u09be,\u09bf,\u09c0,\u09c1,\u09c2,\u09c3,\u09c7,\u09c8,\u09cb,\u09cc,\u09cd,০,১,২,৩,৪,৫,৬,৭,৮,৯
output_character_set: a,ā,i,ī,u,ū,e,ai,o,au,ṛ,ṝ,ḹ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ṛa,ṛa,ḍha,ṛha,ṇa,ta,ṯa,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ẏa,ẏa,ra,la,śa,sha,sa,ha,ṃ,ḥ,n̐,’,ā,i,ī,u,ū,ṛ,e,ai,o,au,,0,1,2,3,4,5,6,7,8,9
- mapping_name: iso-asm-Beng-Latn-15919-2001
id: 15919-2001
authority_id: iso
language: iso-639-2:asm
source_script: Beng
destination_script: Latn
input_character_set: ৰ,ৱ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ৠ,ঌ,ৡ,এ,ঐ,ও,ঔ,া,ি,ী,ু,ূ,ৃ,ৄ,ৢ,ৣ,ে,ৈ,ো,ৌ,ং,ঁ,ঃ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,ড়,ঢ়,য়,যঁ,রঁ,লঁ,বঁ,ৎ,্,১,২,৩,৪,৫,৬,৭,৮,৯,০
output_character_set: va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ,e,ai,o,au,ṁ,m̐,ḥ,ka,kha,ga,gha,ṅa,ca,cha,ja,jha,ña,ṭa,ṭha,ḍa,ḍha,ṇa,ta,tha,da,dha,na,pa,pha,ba,bha,ma,ya,ra,la,śa,ṣa,sa,ha,ṙa,ṙha,ẏa,m̐ya,m̐ra,m̐la,m̐va,t,,1,2,3,4,5,6,7,8,9,0
- mapping_name: icao-ukr-Cyrl-Latn-9303
id: 9303
authority_id: icao
language: iso-639-2:ukr
source_script: Cyrl
destination_script: Latn
input_character_set: "',А,Б,Д,Ё,Е,Э,Ф,Г,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,В,Ы,З,Ч,Я,Ю,Х,Ш,Щ,Ц,Ж,Ґ,Ў,Ѫ,Ђ,Ѕ,Ј,Љ,Њ,Һ,Џ,Є,Ї,Ѓ,І,а,б,д,ё,е,э,ф,г,и,й,к,л,м,н,о,п,р,с,т,у,в,ы,з,ч,я,ю,х,ш,щ,ц,ж,ґ,ў,ѫ,ђ,ѕ,ј,љ,њ,һ,џ,є,ї,ѓ"
output_character_set: ",A,B,D,E,E,E,F,G,Y,I,K,L,M,N,O,P,R,S,T,U,V,Y,Z,CH,IA,IU,KH,SH,SHCH,TS,ZH,G,U,U,D,DZ,J,LJ,NJ,C,DZ,IE,I,G,I,a,b,d,e,e,e,f,g,y,i,k,l,m,n,o,p,r,s,t,,v,y,z,ch,ia,i,kh,sh,shch,ts,zh,g,,,d,dz,j,lj,nj,c,dz,ie,i,g"
....................................
The file is big so I had pasted a chunk of it above. Let me know how we should process it further.
Thanks @bilashsaha !
For output_character_set: va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ...
can we have it broken down per char (but with diacritics): i.e. va,ra,a,ā,i,ī,u,ū,ṛ,ṝ,ḻ,ḹ
=> v,r,a,ā,i,ī,u,ū,ṝ,ḻ,ḹ...
@ronaldtse I had further normalized and made readable(i.e. converted he Unicode to readable characters) the output character set as well as the input character set. The script now looks like as follows:
#!/usr/bin/env ruby
require 'yaml'
def process_inheritance(inheritance,input_character_set=[],output_character_set=[])
parent_map_files = inheritance.kind_of?(Array) ? inheritance : [inheritance]
parent_map_files.each do |map_file|
parent_map_file = YAML.load_file("maps/#{map_file}.yaml")
input_character_set << parent_map_file["map"]["characters"]&.keys
output_character_set << parent_map_file["map"]["characters"]&.values
unless parent_map_file["map"]["inherit"].nil?
process_inheritance(parent_map_file["map"]["inherit"],input_character_set,output_character_set)
end
end
return input_character_set, output_character_set
end
def unescape_unicode(s)
s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
end
maps = Dir["maps/*.yaml"]
output = []
maps.each do |system_file|
map_file = YAML.load_file(system_file)
output_hash = {}
output_hash["mapping_name"] = File.basename(system_file, ".yaml")
output_hash["id"] = map_file["id"]
output_hash["authority_id"] = map_file["authority_id"]
output_hash["language"] = map_file["language"]
output_hash["source_script"] = map_file["source_script"]
output_hash["destination_script"] = map_file["destination_script"]
output_hash["input_character_set"] = map_file["map"]["characters"]&.keys || []
output_hash["output_character_set"] = map_file["map"]["characters"]&.values || []
unless map_file["map"]["inherit"].nil?
input_character_set,output_character_set = process_inheritance(map_file["map"]["inherit"])
output_hash["input_character_set"] << input_character_set
output_hash["output_character_set"] << output_character_set
end
output_hash["input_character_set"] = output_hash["input_character_set"].flatten.compact.map{|e| unescape_unicode(e).scan(/[[:graph:]]/)}.flatten.sort.uniq.join(",")
output_hash["output_character_set"] = output_hash["output_character_set"].flatten.compact.map{|e| e.scan(/[[:graph:]]/)}.flatten.sort.uniq.join(",")
output << output_hash
end
output.sort_by!{|map| map["source_script"]}
File.open('character-set/all_mapping.yaml', 'w') {|f| f.write output.to_yaml }
The output of the script looks like as follows :
- mapping_name: un-ben-Beng-Latn-2016
id: 2016
authority_id: un
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,
output_character_set: A,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,̇,̐,̣,̱,ḍ,ḥ,ṅ,ṇ,ṙ,ṛ,ṣ,ṭ
- mapping_name: bis-ben-Beng-Latn-13194-1991
id: 1991
authority_id: bis
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: ़,्,।,ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,
output_character_set: ".,a,b,c,d,g,h,i,j,k,l,m,n,p,r,s,t,u,y,ā,ē,ī,ŏ,ū,́,̂,̃,̇,̣,ḻ,ṛ,ṣ,ẏ"
- mapping_name: un-asm-Beng-Latn-1972
id: 1972
authority_id: un
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৰ,ৱ
output_character_set: a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,̇,̐,̱,ḍ,ḥ,ṅ,ṇ,ṙ,ṛ,ṣ,ṭ
- mapping_name: iso-pli-Beng-Latn-15919-2001
id: 15919-2001
authority_id: iso
language: iso-639-2:pli
source_script: Beng
destination_script: Latn
input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,া,ি,ী,ু,ূ,ৃ,ৄ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,ৡ,ৢ,ৣ,০,১,২,৩,৪,৫,৬,৭,৮,৯,ৰ,ৱ
output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,́,̇,̐,ḍ,ḥ,ḹ,ḻ,ṅ,ṇ,ṙ,ṛ,ṝ,ṣ,ṭ,ẏ
- mapping_name: iso-ben-Beng-Latn-15919-2001
id: 15919-2001
authority_id: iso
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,া,ি,ী,ু,ূ,ৃ,ৄ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,ৡ,ৢ,ৣ,০,১,২,৩,৪,৫,৬,৭,৮,৯
output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,́,̇,̐,ḍ,ḥ,ḹ,ḻ,ṅ,ṇ,ṙ,ṛ,ṝ,ṣ,ṭ,ẏ
- mapping_name: alalc-ben-Beng-Latn-1997
id: 1997
authority_id: alalc
language: iso-639-2:ben
source_script: Beng
destination_script: Latn
input_character_set: ऽ,ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,়,া,ি,ী,ু,ূ,ৃ,ে,ৈ,ো,ৌ,্,ৎ,ড়,য়,ৠ,০,১,২,৩,৪,৫,৬,৭,৮,৯
output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,y,ā,ī,ū,́,̃,̄,̇,̐,̣,̱,ḹ,ṛ,ṝ,’
- mapping_name: iso-asm-Beng-Latn-15919-2001
id: 15919-2001
authority_id: iso
language: iso-639-2:asm
source_script: Beng
destination_script: Latn
input_character_set: ঁ,ং,ঃ,অ,আ,ই,ঈ,উ,ঊ,ঋ,ঌ,এ,ঐ,ও,ঔ,ক,খ,গ,ঘ,ঙ,চ,ছ,জ,ঝ,ঞ,ট,ঠ,ড,ঢ,ণ,ত,থ,দ,ধ,ন,প,ফ,ব,ভ,ম,য,র,ল,শ,ষ,স,হ,া,ি,ী,ু,ূ,ৃ,ৄ,ে,ৈ,ো,ৌ,্,ৎ,ড়,ঢ়,য়,ৠ,ৡ,ৢ,ৣ,০,১,২,৩,৪,৫,৬,৭,৮,৯,ৰ,ৱ
output_character_set: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,y,ñ,ā,ī,ū,́,̇,̐,ḍ,ḥ,ḹ,ḻ,ṅ,ṇ,ṙ,ṛ,ṝ,ṣ,ṭ,ẏ
- mapping_name: icao-ukr-Cyrl-Latn-9303
id: 9303
authority_id: icao
language: iso-639-2:ukr
source_script: Cyrl
destination_script: Latn
input_character_set: "',Ё,Ђ,Ѓ,Є,Ѕ,І,Ї,Ј,Љ,Њ,Ў,Џ,А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,Щ,Ы,Э,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ы,э,ю,я,ё,ђ,ѓ,є,ѕ,ї,ј,љ,њ,ў,џ,Ѫ,ѫ,Ґ,ґ,Һ,һ"
output_character_set: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,R,S,T,U,V,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,r,s,t,v,y,z
- mapping_name: bgnpcgn-tat-Cyrl-Latn-2007
id: 2007
authority_id: bgnpcgn
language: iso-639-2:tat
source_script: Cyrl
destination_script: Latn
input_character_set: А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,Щ,Ъ,Ы,Ь,Э,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ъ,ы,ь,э,ю,я,Җ,җ,Ң,ң,Ү,ү,Һ,һ,Ә,ә,Ө,ө
output_character_set: A,B,C,D,E,F,H,I,J,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z,a,b,c,d,e,f,h,i,j,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,Ä,Ç,Ñ,Ö,Ü,ä,ç,ñ,ö,ü,Ğ,ğ,İ,ı,Ş,ş,Ə,ə,Х,’,Ꞑ,ꞑ
- mapping_name: bgnpcgn-bul-Cyrl-Latn-2013
id: 2013
authority_id: bgnpcgn
language: iso-639-2:bul
source_script: Cyrl
destination_script: Latn
input_character_set: А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,Щ,Ъ,Ь,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ъ,ь,ю,я,Ѣ,ѣ,Ѫ,ѫ
output_character_set: "',A,B,C,D,E,F,G,I,K,L,M,N,O,P,R,S,T,U,V,Y,Z,\\,a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,v,y,z,̆"
- mapping_name: odni-srp-Cyrl-Latn-2015
id: 2015
authority_id: odni
language: iso-639-2:srp
source_script: Cyrl
destination_script: Latn
input_character_set: Ђ,Ј,Љ,Њ,Ћ,Џ,А,Б,В,Г,Д,Е,Ж,З,И,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ц,Ч,Ш,а,б,в,г,д,е,ж,з,и,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,ђ,ј,љ,њ,ћ,џ
output_character_set: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,R,S,T,U,V,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,r,s,t,u,v,z
- mapping_name: odni-uig-Cyrl-Latn-2015
id: 2015
authority_id: odni
language: iso-639-2:uig
source_script: Cyrl
destination_script: Latn
input_character_set: Ё,А,Б,В,Г,Д,Е,Ж,З,И,Й,К,Л,М,Н,О,П,Р,С,Т,У,Ф,Х,Ч,Ш,Ю,Я,а,б,в,г,д,е,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ч,ш,ю,я,ё,Ғ,ғ,Җ,җ,Қ,қ,Ң,ң,Ү,ү,Һ,һ,Ә,ә,Ө,ө
output_character_set: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,w,x,y,z
@bilashsaha the output looks good, but this row seems a bit strange?
output_character_set: "',A,B,C,D,E,F,G,I,K,L,M,N,O,P,R,S,T,U,V,Y,Z,\\,a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,v,y,z,̆"
It contains \\
?
@ronaldtse Yes I have investigated and found that bgnpcgn-bul-Cyrl-Latn-2013
is inheriting bgnpcgn-bul-Cyrl-Latn-1952
and atbgnpcgn-bul-Cyrl-Latn-1952
there exists a mapping '\u042c': "\\'"
at line : https://github.com/interscript/interscript/blob/master/maps/bgnpcgn-bul-Cyrl-Latn-1952.yaml#L79
@bilashsaha in this particular map bgnpcgn-bul-Cyrl-Latn-1952, the rule is supposed to be:
\u042c => '
Can you help fix this?
Regarding top message:
We now know there are systems that span multiple languages,
We have proof-of-concept for multi-language systems https://github.com/interscript/interscript/pull/570
or support different character sets (same script code, e.g. Latn, but different characters, e.g. diacritics).
I propose a similar approach for "output characters":
destination_script
property to accept a hash of valuesEach of values destination_script
can be an alias: Latin
, Cyrl
or an array of characters that are accepted (as shown in the Implementation above, the example below are valid:
- mapping_name: bgnpcgn-tat-Cyrl-Latn-2007
id: 2007
authority_id: bgnpcgn
language: iso-639-2:tat
source_script: Cyrl
destination_script: Latn
... or ....
- mapping_name: bgnpcgn-tat-Cyrl-Latn-2007
id: 2007
authority_id: bgnpcgn
language: iso-639-2:tat
source_script: Cyrl
destination_script:
Latn:
ASCII:
MY_CUSTOM_CHARSET: [A,B,C,D,E,F,H,I,J,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z,a,b,c,d,e,f,h,i,j,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,Ä,Ç,Ñ,Ö,Ü,ä,ç,ñ,ö,ü,Ğ,ğ,İ,ı,Ş,ş,Ə,ə,Х,’,Ꞑ,ꞑ]
destination_script
and language
ASCII
script as a more strict version of Latn
- ASCII = Latin - (characters with diacritics)
Benefits:
@ronaldtse @bilashsaha please review the concept
Ping @ribose-jeffreylau to follow up on this.
ISO 24229 was the basis for Interscript system codes.
With the progress of Interscript we are now aware that the original code "system" doesn't quite work: i.e. the pattern
"{authority code}-{lang}-{source script}-{target-script}-{id}"
We now know there are systems that span multiple languages, or support different character sets (same script code, e.g. Latn, but different characters, e.g. diacritics).
So we need to introduce a new "spelling system code" for all languages and transliteration systems.
We need to:
The input and output spelling systems can be identified by the defined character sets (in an interscript map, the rule keys are the input character set, the rule values are the output character set).
This task is to do this.
First, we need to make a script to define the character input and output sets of the interscript systems.