Open djinn opened 3 years ago
@djinn Thank you for reaching out :)
https://github.com/hindawiai/hindawi2021/blob/master/Romenagri/acii2cf.lex This is the Lex source for the Romenagri CF in this repo.
We are currently developing in the Chintamani (Telugu target). That's all also serving as a surrogate for other languages. The Telugu branch is getting all the commits right now https://github.com/hindawiai/chintamani/tree/telugu
There have been core changes to Romenagri in Chintamani to take it as close to IPA (Intl Phonetic Alphabet) as possible. The acii2cf sources and lots of front end filters for Perso-Arabic scripts are available there.
The script for all Indic scripts as supported in ISCII standard is there. Round trip should work - yet to test.
A lot of house-keeping needed merging all these back! I will mark a ref issue in Chintamani
Here's my current target workflow (on a Chintamani clone Telugu branch ./Romenagri dir) printf "یہ ہائی اسکول کے طلبا کو تربیت دیتا ہے" | . ./fltr_ar_pra | . ./fltr_ar_prb | ./fltr_urhi | iconv -tutf16 | uni2acii | acii2cf | tr '^' '' | rmn2acii | acii2uni | iconv -futf16
Tracking at https://github.com/hindawiai/chintamani/issues/1
https://colab.research.google.com/github/hindawiai/chintamani/blob/master/Notebooks/%E0%A4%B9%E0%A4%BF%E0%A4%82%E0%A4%A6%E0%A4%B5%E0%A5%80_2021_7_%E0%A4%B8%E0%A5%8D%E0%A4%AE%E0%A4%B0%E0%A4%A3_%E0%A4%AA%E0%A5%81%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A4%BF%E0%A4%95%E0%A4%BE.ipynb#scrollTo=fEf5G-CzMb3M Link to installation cell in the Notebook - opens in Colab. NB source - https://github.com/hindawiai/chintamani/blob/master/Notebooks/%E0%A4%B9%E0%A4%BF%E0%A4%82%E0%A4%A6%E0%A4%B5%E0%A5%80_2021_7_%E0%A4%B8%E0%A5%8D%E0%A4%AE%E0%A4%B0%E0%A4%A3_%E0%A4%AA%E0%A5%81%E0%A4%B8%E0%A5%8D%E0%A4%A4%E0%A4%BF%E0%A4%95%E0%A4%BE.ipynb
Chintamani has binaries checked in please recompile. (The bin files will be removed in the next commit)
https://github.com/hindawiai/chintamani/blob/telugu/Romenagri/acii2cf.lex
If you are compiling by hand, then the Romenagri lib will need to be built first.
There's are test corpus files for other scripts. e.g. cat corp_pa.txt | ./flatten_uni_dev
Our first target is to round_trip through Devnagri. That works just fine, except that Phonetically there are shifts between different languages. Like Abhishek in Bangla is pronounced more like Obhishek. Our objective is to be as close to phonetic fidelity as feasible. That will help in other components for TTS and spech as we get to the AI layers.
If you look at Guru Makefile the acii2cf is required. I am not able to find this file referenced anywhere but hindawi repo. Where does this file exist?