microsoft / bistring

Bidirectionally transformed strings
MIT License
366 stars 18 forks source link

Transliterate #19

Open christian-storm opened 5 years ago

christian-storm commented 5 years ago

I was hoping you might advise me on how to incorporate transliteration into a text transformation pipeline.

Let's say I want to use a 3rd party library like from unidecode import unidecode. I could create a bistring with new_bistr = bistr(text.modified, unidecode(text.modified)) but I would loose all the previous operations.

Is there a way to fold in a modified string that is calculated outside bistring's capabilities?

tavianator commented 5 years ago

In general no. You could use something like bistr.infer(text, unidecode(text)) to have it guess.

In your case, you could do a little better since the transliteration process probably operates character-by-character. Something like

tokenizer = CharacterTokenizer('und')  # or 'en-US', etc.
builder = BistrBuilder(text)
for token in tokenzier.tokenize(text):
    builder.replace(token.end - token.start, unidecode(token.modified))
text = builder.build()

By the way, it's on my backlog to implement support for ICU's Transliterator API which is more powerful than unidecode and similar things.

tavianator commented 5 years ago

So since https://github.com/ovalhub/pyicu/issues/107 was implemented, I've tested out an implementation that wraps a bistr in a Replaceable for ICU. It works well for simple transliterations like Latin-ASCII, but for complicated ones like Greek-Latin ICU does some strange things that I'm not sure how to cope with nicely:

('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffff')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'Oδυσσεύς')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςO')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'OdδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'Odδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςd')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςy')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύς')
('Ὀδυσσεύς' ⇋ 'Odysσεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύςs')
('Ὀδυσσεύς' ⇋ 'Odyssεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύς')
('Ὀδυσσεύς' ⇋ 'Odysseύς')
('Ὀδυσσεύς' ⇋ 'Odysseύςe')
('Ὀδυσσεύς' ⇋ 'Odysseύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύς')
('Ὀδυσσεύς' ⇋ 'Odysseúς')
('Ὀδυσσεύς' ⇋ 'Odysseúς́')
('Ὀδυσσεύς' ⇋ 'Odysseúς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')
christian-storm commented 5 years ago

Thank you for the great info and tips. Agreed that transliteration doesn't always make sense to do, e.g., your example.

I realize now why I didn't think to do it the way you mentioned. I had it in my mind that bistr keeps track of each operations output instead of always overriding modified, i.e., modified is a list so one could rollback to a certain state. I had built this into my own version of this. The use case being that I could see which operation the caused the string transformation train to derail.

tavianator commented 5 years ago

Ah I see, but that would be polystring, not bistring :). More seriously, I am considering adding a data type that would retain an entire history of transformations, rather than just the initial and final states. The Emacs region-specific undo buffer stuff seems to have that, for example, but I'm not sure what encoding they use. I imagine it's a persistent stack of ropes or something.