interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Convert systems into new language #691

Closed ronaldtse closed 3 years ago

ronaldtse commented 3 years ago

We have created a new transliteration system definition language at https://github.com/interscript/lcs.

It is meant to be a implementation independent language for defining a transliteration system. It is based on an explicit information model so the systems defined using it are reproducible.

This language is now ready for feedback and migration from the interscript YAML definitions.

I would like to ping all the map authors for a review to have issues raised before migration. Thanks!

antonsviridenko commented 3 years ago

I looked through existing documentation, was unable to find instructions how to make test runs, i.e what and how to run to convert source text into the destination script. Is it a Ruby program?

AhMohsen46 commented 3 years ago

I want to try running this for an arabic map I can't say I 100% understood it, i will try to re-examine it tomorrow asap but am wondering if this is gonna work with transliteration we do using lookbehind/ahead regex and if it's gonna work as well with collisions between letters "different non latin letters, mapping to the exact same latin letter" and "same non latin letter, mapping to different latin letters in different circumstances"

webdev778 commented 3 years ago

please check this documentation

webdev778 commented 3 years ago

I looked through existing documentation, was unable to find instructions how to make test runs, i.e what and how to run to convert source text into the destination script. Is it a Ruby program?

this is missing, right

webdev778 commented 3 years ago

I want to try running this for an arabic map I can't say I 100% understood it, i will try to re-examine it tomorrow asap but am wondering if this is gonna work with transliteration we do using lookbehind/ahead regex and if it's gonna work as well with collisions between letters "different non latin letters, mapping to the exact same latin letter" and "same non latin letter, mapping to different latin letters in different circumstances"

yes, you have a syntax sub "from", "to", before: "lookbehind", after: "lookahead" which would compile to .gsub(/(?=lookbehind)from(?>=lookahead)/)

ronaldtse commented 3 years ago

I looked through existing documentation, was unable to find instructions how to make test runs, i.e what and how to run to convert source text into the destination script. Is it a Ruby program?

We have a Ruby interpreter and @webdev778 is working on a Javascript interpreter. @webdev778 can you provide documentation? Thanks.

webdev778 commented 3 years ago

to be clear, we have a Ruby interpreter and a Ruby compiler which differ in one small way - Ruby interpreter executes the map directly and Ruby compiler compiles the map to Ruby code. this means that the Ruby interpreter executes maps more slowly, but "builds" them faster, and Ruby compiler executes maps about 2x faster but takes its time to build them. that's not a big difference, only those who will use this library in their programs will need to understand this and they will be able to pick a more suitable implementation.

about documentation, i will try to write it later today, but the old API is mostly retained. just as you work with the current interscript version, the same commands and APIs should work.

webdev778 commented 3 years ago

we have documentation ready. please consult the readme:

https://github.com/interscript/lcs/blob/main/README.adoc

the maps referenced in this readme may not be ready yet. only those maps are ported already:

https://github.com/interscript/lcs/tree/main/maps/maps

chaaklau commented 3 years ago

I am looking at the Korean maps. I managed to run these maps on my machine: var-kor-Hang-Hang-jamo.imp, var-kor-Kore-Hang-2013.imp

These maps return an error due to a missing dependency var-kor: moct-kor-Hang-Latn-2000.imp, iso-kor-Hang-Latn-1996-method1.imp, bgnpcgn-kor-Kore-Latn-rok-2011.imp


The new sub syntax and parallel blocks look fine, but there is one potential issue that I need to bring to your attention. There are three sections in the old maps: rule (regex substitutions which are run before map), map (simple "character" string-to-string mapping), postrule (regex substitutions which are run after map). It may seem that substitutions are run in parallel within each of these sections, but in fact they are run in a specific order. The rules are sorted by length (longest first), i.e. long strings will run first.

    def build_hashes
      @characters_hash = characters&.sort_by { |k, _v| k.size }&.reverse&.to_h
      @dictionary_hash = dictionary&.sort_by { |k, _v| k.size }&.reverse&.to_h
    end

Japanese and Korean-Hanja maps rely on this logic to work. To show this using the latest syntax, sub "ちゃ", "cha" needs to run before sub "ち", "chi", and the order is not explicitly specified in the old maps. This can be handled by parallel with the new syntax, just that one needs to be aware of this issue, and should not put everything into the same parallel block.

webdev778 commented 3 years ago

var-kor is a library, it's located in maps/libs/. your code probably doesn't load it because you don't load "interscript-maps" gem somehow (do you run bundle exec or Bundler.setup?) - this gem is loaded in the maps/ directory of the repository. setting up this gem is essential for setting correct paths to maps and libs. otherwise, there is another way to add the necessary directories to the map load path (where "." is by default), just do Interscript.load_path << "absolute/path/to/other/map/or/library/directory" (note to self: it may be good to document that). the structure we have currently will allow in the future to create other addon gems with custom maps.

yes, you are right about parallel - it generally bypasses the regexp engines so it doesn't have all the features (as of now, it only supports any) and you are right this block implicitly sorts by size (the sequential one - stage - does not). if someone was to use advanced regexp features, for example before: or after: inside parallel, it would throw an error describing precisely, that such a feature is not supported in parallel{}, and as many of those rules/postrules need this, it should note clearly that a map writer is doing something wrong.

ronaldtse commented 3 years ago

if someone was to use advanced regexp features, for example before: or after: inside parallel, it would throw an error describing precisely, that such a feature is not supported in parallel{},

@webdev778 can we make sure that this is described in documentation for system writers? Thanks.

webdev778 commented 3 years ago

I have changed the behavior slightly. Because there were possible regular expression in character sections, the best way was to actually allow advanced expressions in parallel{} sections. For the basic ones we can use a fast track mode, but for the advanced ones we construct a "mega regexp". Today saw a bigger improvement which allowed us to port a lot of Arabic maps. I will fill in the documentation shortly.

opoudjis commented 3 years ago

https://github.com/interscript/lcs/blob/main/docs/Interscript_Map_Format.adoc

An ascii word character (a-z, A-Z, 0-9, _)

That is... an odd constraint, given that you're doing transliteration from non-ASCII scripts; we cannot presuppose that these expressions will always work on ASCII. It would be preferable to use the Unicode properties related to word characters, such as [[:alpha:]] or [[:word:]] or \p{Alnum} or /\p{L}/. The catch of course is that you would need to do lookahead to implement boundary.

opoudjis commented 3 years ago

https://github.com/interscript/lcs/blob/bfffb20f25a40b3d36eb0908afe275e0759f290a/maps/maps/alalc-ell-Grek-Latn-2010.imp

I think it's a shame that the multi-line test at the start is rendered as a continuous string, and not with human-readable line wrapping.

But so long as my tests are working, I'm happy with where this has gotten to.

webdev778 commented 3 years ago

https://github.com/interscript/lcs/blob/main/docs/Interscript_Map_Format.adoc

An ascii word character (a-z, A-Z, 0-9, _)

That is... an odd constraint, given that you're doing transliteration from non-ASCII scripts; we cannot presuppose that these expressions will always work on ASCII. It would be preferable to use the Unicode properties related to word characters, such as [[:alpha:]] or [[:word:]] or \p{Alnum} or /\p{L}/. The catch of course is that you would need to do lookahead to implement boundary.

I agree. We found out the hard way that boundary was implemented differently between JS (where it had \w semantics) and Ruby (where it had \p{L} semantics). We polyfilled it with an ugly and inefficient regexp which tanked performance twofold (still better than the Opal implementation).

I think that the new approach is good, because we can gradually deprecate some old behaviors and fix them all in the maps.

https://github.com/interscript/lcs/blob/bfffb20f25a40b3d36eb0908afe275e0759f290a/maps/maps/alalc-ell-Grek-Latn-2010.imp

I think it's a shame that the multi-line test at the start is rendered as a continuous string, and not with human-readable line wrapping.

But so long as my tests are working, I'm happy with where this has gotten to.

I agree. We can introduce a new syntax for multiline maps in the future.

ronaldtse commented 3 years ago

This is all done now.