hindawiai / hindawi2021

Bootstrapping Hindawi Programming System
0 stars 0 forks source link

Where do I find acii2cf file? #1

Open djinn opened 3 years ago

djinn commented 3 years ago

If you look at Guru Makefile the acii2cf is required. I am not able to find this file referenced anywhere but hindawi repo. Where does this file exist?

obonac commented 3 years ago

@djinn Thank you for reaching out :)

https://github.com/hindawiai/hindawi2021/blob/master/Romenagri/acii2cf.lex This is the Lex source for the Romenagri CF in this repo.

We are currently developing in the Chintamani (Telugu target). That's all also serving as a surrogate for other languages. The Telugu branch is getting all the commits right now https://github.com/hindawiai/chintamani/tree/telugu

There have been core changes to Romenagri in Chintamani to take it as close to IPA (Intl Phonetic Alphabet) as possible. The acii2cf sources and lots of front end filters for Perso-Arabic scripts are available there.

The script for all Indic scripts as supported in ISCII standard is there. Round trip should work - yet to test.

A lot of house-keeping needed merging all these back! I will mark a ref issue in Chintamani

Here's my current target workflow (on a Chintamani clone Telugu branch ./Romenagri dir) printf "یہ ہائی اسکول کے طلبا کو تربیت دیتا ہے" | . ./fltr_ar_pra | . ./fltr_ar_prb | ./fltr_urhi | iconv -tutf16 | uni2acii | acii2cf | tr '^' '' | rmn2acii | acii2uni | iconv -futf16

obonac commented 3 years ago

Tracking at https://github.com/hindawiai/chintamani/issues/1

obonac commented 3 years ago

https://colab.research.google.com/github/hindawiai/chintamani/blob/master/Notebooks/%E0%A4%B9%E0%A4%BF%E0%A4%82%E0%A4%A6%E0%A4%B5%E0%A5%80_2021_7_%E0%A4%B8%E0%A5%8D%E0%A4%AE%E0%A4%B0%E0%A4%A3_%E0%A4%AA%E0%A5%81%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A4%BF%E0%A4%95%E0%A4%BE.ipynb#scrollTo=fEf5G-CzMb3M Link to installation cell in the Notebook - opens in Colab. NB source - https://github.com/hindawiai/chintamani/blob/master/Notebooks/%E0%A4%B9%E0%A4%BF%E0%A4%82%E0%A4%A6%E0%A4%B5%E0%A5%80_2021_7_%E0%A4%B8%E0%A5%8D%E0%A4%AE%E0%A4%B0%E0%A4%A3_%E0%A4%AA%E0%A5%81%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A4%BF%E0%A4%95%E0%A4%BE.ipynb

obonac commented 3 years ago

Chintamani has binaries checked in please recompile. (The bin files will be removed in the next commit)

https://github.com/hindawiai/chintamani/blob/telugu/Romenagri/acii2cf.lex

If you are compiling by hand, then the Romenagri lib will need to be built first.

There's are test corpus files for other scripts. e.g. cat corp_pa.txt | ./flatten_uni_dev

Our first target is to round_trip through Devnagri. That works just fine, except that Phonetically there are shifts between different languages. Like Abhishek in Bangla is pronounced more like Obhishek. Our objective is to be as close to phonetic fidelity as feasible. That will help in other components for TTS and spech as we get to the AI layers.