kriskowal / tengwarjs

A Tengwar (J.R.R. Tolkien’s Elvish alphabet) transcriber for ES5 and HTML5
http://tengwar.3rin.gs
MIT License
58 stars 9 forks source link

Philosophy for mode for general use #31

Closed kriskowal closed 4 years ago

kriskowal commented 4 years ago

In the current incarnation of the mode for general use, it is relatively unlikely that an untrained user will produce a proper result.

There are two general approaches for transcribing English, phonemic and orthographic.

Given that the input is perfect English orthography, it would be reasonable to expect a computer to faithfully produce a perfect orthographic output, short of providing an XSampa mode and grafting a complete phonemic dictionary of English into the editor. This avenue is possible but I don’t have the resources.

However, I’ve chosen in general to prefer a phonetic input. Since a program does not have a prayer of guessing the phonemic interpretation from English orthography, I’ve generally trained the mode to use Tolkien’s own syllabary to express phonemes, so the mode is 100% reliable for Sindarin and apt to fall short for English. Consequently, the Sindarin output is the default and various numbers of ticks force it to chose alternatives.

There are of course exceptions, where some well-known words receive better treatment and final E is assumed to be silent, for English.

So, I’d like to open a discussion. If we strive to get every English word right, dancing as Tolkien did between orthographic and phonemic to suit his taste, we will need a rather large dictionary or a very large number of heuristics. Neither will free the author entirely from the obligation of being able to read and verify that soundness of the output.

On the other hand, we can elect to continue using a simplified, Sindarin-alike syllabary, document it rigorously, bring it to the foreground in the editor, and try to guide the user with an explanation of the output and what alternatives are available at various points.

Either way, there’s a great deal of work between what we have and what we would all likely want.

dreamingfifi commented 4 years ago

You have said quite eloquently the problem with trying to have a computer do the transliteration of a language with a writing system as old as English's, that hasn't been reformed in 800 years. One that has been vacuuming up vocabulary and taking their spelling systems too. English spelling is so irregular that the computer will always make mistakes.

We could make up a dictionary with the millions of English words and their proper transliterations. It would take a decade and hundreds of people helping. It's just not feasible. Not without hefty amounts of money, at least.

We can make it more sophisticated by adding things like -sure, -sual, and -sion detection. Maybe add words ending in -the (indicates a voiced alveo-dental fricative is most likely) and have it look for soft Cs by looking at following E or I. The point is - there are things we can do to improve this, to make it a touch more sophisticated than it is - but that's not much more than the approach we already have.

So... to make it completely accurate.... not possible without mountains of cash and many years of work. We'll just have to accept that there are 4 different CHs distinguished between in Tolkien's orthographic tengwar, but the computer can't tell them apart. Unless the word starts with CHR which I think is always pronounced KR (would be written "quesse-extended;romen")

Anyways, I've gotten side-tracked.

We'll just have to be comfortable with the fact that what we make will be far from perfect. But, I think that there are ways that we could at least make it better.

kriskowal commented 4 years ago

Okay. I propose that as a matter of principle, that we attempt to write tests both for the rule and the exception, so we can see side by side the most common cases where back-ticks will be necessary to adjust the behavior. The output will be correct more often, but we should also increase the clarity that this is a tool to assist Elvish experts and unlikely to produce a proper result casually.

Adding pattern matching for suffixes is possible. As a Linguist, I’m sure you’re familiar with Chomsky’s hierarchy. The kind of parser I’ve written is equipped for arbitrary back tracking and look ahead, but it is not pretty and will perform quite poorly. I hope you will forgive if I don’t spend my weekends on that particular issue. Hopefully we can lure another engineer onto the project. I don’t expect you or Halfdan would be able to carry that forward, but the patterns are all there in the code. It involves walking another indentation level for every look-ahead, for each sequence of characters. It also involves detecting end-of-word, as is done to distinguish medial from final S hooks.

The transcriber is also able to use a hint about the input language. We should be careful not to apply any of these rules to Sindarin in the same mode.