tibetan-nlp / old-tibetan-corpus

Linguistically analyzed Old TIbetan documents and some tools for processing Old Tibetan text
MIT License
5 stars 1 forks source link

Restore missing merged syllables to the OT Chronicle CONLLU file #11

Closed heacu closed 3 years ago

heacu commented 3 years ago

The following merged Old Tibetan syllables can be found in the Unicode version of the OT Chronicle (which I assume is a more or less accurate conversion of OTDO's Wylie), but are missing from BRAT and our CoNLL-U file.

The reference preceding the syllable gives an approximate location of where the syllable should occur.

otchronicle:001:T80 - བགྱིསྣ otchronicle:002:T5 - ཚངསུ་ otchronicle:002:T632 - ལྟསུ་ otchronicle:002:T985 - རླགི་ otchronicle:003:T1135 - རླགྀ་ otchronicle:003:T1317 - myI la 'phrog (47) [-]om otchronicle:003:T40 - བརྩིགསོ otchronicle:003:T1419 - གཤེགསོ otchronicle:003:T42 - ལགསོ otchronicle:003:T1437 - གཤེགསོ otchronicle:003:T1448 - གཤེགསོ otchronicle:003:T52 - མཆིསྣ otchronicle:003:T49 - འདའསོ otchronicle:004:T10 - ལྕེབསའོ otchronicle:004:T1567 - མནངསུ་ otchronicle:004:T17 - གཤེགསོ -- basically... every occurrence of གཤེགསོ... find all of them otchronicle:004:T1580 - གསུངསོ -- search for more གསུངསོ, you'll find missing ones otchronicle:004:T36 - གཤེགསའོ otchronicle:004:T1802 - འཛངསོ otchronicle:004:T52 - ལགསོ otchronicle:006:T2 - བྱསོ (occurs three times close together) otchronicle:006:T3201 - གསགྀ་ otchronicle:007:T3769 - བསྒོ་ (occurs twice in vicinity) otchronicle:008:T4344 - སྐུགསོ otchronicle:009:T4522 - མོལ་ otchronicle:009:T4534 - བཏམསོ otchronicle:009:T4551 - འཐབསོ otchronicle:009:T28 - གཐོགསོ otchronicle:009:T4940 - བཀྱེའོ otchronicle:009:T43 - ཕགྀ་ otchronicle:010:T14 - བཏགསོ otchronicle:011:T30 - འབངསུ་ otchronicle:011:T6340 - བླངསོ otchronicle:012:T33 - བདགྀ་ otchronicle:012:T6960 - as above otchronicle:014:T3 - བརྩིགསོ otchronicle:015:T8019 - བདགྀ་ otchronicle:015:T8276 - འབངསུ་ otchronicle:016:T8645 - རིངོ otchronicle:016:T8753 - བདགྀ་ otchronicle:016:T8759 - བོམསུ otchronicle:016:T8879 - བདགྀ་ otchronicle:016:T8983 - བརླགོ otchronicle:016:T9015 - སྟགྀ་ otchronicle:017:T9185 - འབངསུ་ otchronicle:017:T9254 - མོལ་ -- check for other cases of missing མོལ་ - there should be a few otchronicle:018:T20 - བཏགསའོ otchronicle:018:T22 - ཡུངསུ་ otchronicle:018:T37 - སྟགྀ་ otchronicle:020:T11075 - བདགྀ་ otchronicle:020:T11322 - སྟགི་ otchronicle:021:T11577 - ཐབསུ otchronicle:021:T11607 - མཆིསོ otchronicle:022:T4 - གདངསུ་ otchronicle:022:T12204 - མོལ་ (2 cases) otchronicle:025:T8 - see ཡར་ཀྱྀ་*ni་ལྷོ་རྔེགས་ཟའ in Tibetan Unicode otchronicle:026:T14134 - ཐུངེ་ otchronicle:026:T14207 - གཆྀགི་ otchronicle:027:T20 - བཏགསོ otchronicle:027:T14844 - འབངསུ་ (2 cases)

Corrections should be made directly to the CoNLL-U file @FChrispz

FChrispz commented 3 years ago

@heacu all the above syllables have been restored in the CONLL-U file