indic-transliteration / indic_transliteration_py

Python package for indic script transliteration
MIT License
166 stars 33 forks source link

Add python native font conversion support #38

Open codito opened 4 years ago

codito commented 4 years ago

Here's a script that attempts to convert text in Surekh font to Unicode: https://gist.github.com/codito/cb31ba37b0a4e5a77dc03c84a3ebc50d

Underlying library is here: https://github.com/sushant354/indic2unicode

Tasks

Sanskrit programmers email thread for reference: https://groups.google.com/forum/#!msg/sanskrit-programmers/erYjhaqAciQ/Yha8ho6QAQAJ

codito commented 4 years ago

This code can be imported into https://github.com/sanskrit-coders/indic_transliteration/tree/master/indic_transliteration/font_converter module.

vvasuki commented 4 years ago

Looks like indic2unicode is now available in python 3. I've raised an issue in sudhAnt's repo.

Regarding the accent recognition issue:
the string ÊSÉSÉjÉÉè VÉMÉMÉiÉÉä should be translated as चि॒त्रौ जग॑तो, rather than as चिचत्रौ जगगतो. So, essentially, ÉS ÉM should be replaced with ॒ ॑svaras respectively beforehand.

vvasuki commented 3 years ago

This naive technique did not work:

  def convert_handling_svaras(self, text):
    text = regex.sub("ÉS", "॒", text)
    text = regex.sub("ÉM", "॑", text)
    out_text = self.convert(text=text)
    out_text = regex.sub("([॒॑])([ा-ॏऀ-ः])", "\\2\\1", out_text)
    return out_text

Yielded स्ा॑मैगक्षिष्योचध्ध्वम्ा॑हस स्ा॑मैगक्षिष्योचध्ध्वम्ा॑हसआदिचत्येन्ा॑