indic-transliteration / sanscript.js

Transliteration package for Indian scripts
MIT License
99 stars 39 forks source link

Move schemes into separate JSON files #13

Closed psvenk closed 4 years ago

psvenk commented 4 years ago

This has the benefits of separating data from code and of having a language-agnostic specification of schemes (for implementations of Sanscript in other programming languages).

This also adds a build step integrating the scheme data into the JS file — the file that now contains only the code has been moved to a subdirectory and the build step (npm run dist) outputs to sanscript.js (so this causes no breaking changes). There is also an additional development dependency of @iarna/toml.

I chose TOML over JSON and other formats because it supports comments and it can represent the scheme data in a simple and readable format without any indentation. I generated the TOML file using a script taking the initial declaration of Sanscript.schemes and serializing it using the TOML library, followed by adjusting the formatting and adding in the comments from the JS file. Having a separate TOML file actually improves JSON compatibility because the following Node.js script converts the TOML file to valid JSON (automatically stripping comments):

Code ```javascript const TOML = require("@iarna/toml"); const fs = require("fs"); const path = require("path"); const schemes = TOML.parse(fs.readFileSync( path.join(__dirname, "src", "schemes.toml") )); fs.writeFileSync( path.join(__dirname, "src", "schemes.json"), JSON.stringify(schemes) ); ```

I have marked this pull request as "draft" because there is some refactoring to be done before the TOML file is truly portable between programming languages (e.g. some encodings, such as Kolkata, are defined in code), but I wanted to get your thoughts on the idea before proceeding with that.

All tests (35 tests; 780 assertions) are passing when I open tests/index.html in my browser.

rramphal commented 4 years ago

Hey @psvenk, this is an interesting idea!

Personally, I don't think we gain much by using the TOML format over the JSON format in this specific case, but of course, it's basically the same. We do avoid adding a dependency by using JSON, but, if we're already adding a build step, adding such a well-tested dependency is arguably a non-issue.

If you're going to introduce a build step, why not go all the way and split each language out into its own file? Regardless of whether we stick with JSON or TOML, we could define a Sanscript Scheme schema using something like ajv, which is something that even the current implementation could benefit from. Given the fact that the schema is unlikely to change often, it would just need to really be done once while maintaining consistency between the different language files. If we do use TOML, it would obviously be better to write the schema for TOML, however TOLS is still in the works.

So basically:

assamese.toml ... zanbazar_square.toml > concatenate into schemes.toml > transcode into schemes.json > validate in toto > inject into sanscript.js (npm run dist) > publish (npm run publish)

-- OR --

assamese.toml ... zanbazar_square.toml > transcode into assamese.json ... zanbazar_square.json > validate separately > concatenate into schemes.json > inject into sanscript.js (npm run dist) > publish (npm run publish)

We could even have a GitHub action handle this for us.

psvenk commented 4 years ago

@rramphal Thanks for the speedy response!

Personally, I don't think we gain much by using the TOML format over the JSON format in this specific case, but of course, it's basically the same. We do avoid adding a dependency by using JSON, but, if we're already adding a build step, adding such a well-tested dependency is arguably a non-issue.

I agree that there is not much to gain in this case, especially with splitting each scheme into its own file (in which case the files consist of only keys mapping to arrays). JSON does have the benefit of being more widely used in the JS world too; maybe we could introduce a description field containing what is currently in comments (possibly along with a way for client code to display the description of a scheme).

If you're going to introduce a build step, why not go all the way and split each language out into its own file?

Thanks for pointing this out; I'll do this in a commit soon (in JSON).

Regardless of whether we stick with JSON or TOML, we could define a Sanscript Scheme schema using something like ajv, which is something that even the current implementation could benefit from. Given the fact that the schema is unlikely to change often, it would just need to really be done once while maintaining consistency between the different language files. If we do use TOML, it would obviously be better to write the schema for TOML, however TOLS is still in the works.

This is a great idea. We could also add to the schema some properties which are currently implemented as special cases in the code. The fact that a somewhat mature schema representation language is available for JSON but not for TOML leads me to believe that we should use JSON for storing data on the scripts/romanizations for consistency. I also like the idea of verifying this with CI.

vvasuki commented 4 years ago

Good - I like the general idea as well as the idea of splitting into separate json/toml files. If we're ditching toml because schema specification and validation is simpler elsewhere, might as well pick https://json5.org/ .

Action item for later:

psvenk commented 4 years ago

I have switched to JSON, split schemes.json into one file per scheme, and squashed and rebased on master. As expected, two tests are failing due to #4 (non-Sanskrit letters from Devanagari to ITRANS, and Devanagari to Gurmukhi due to a change in tests.js splitting off the combining nuqta character from dotted Gurmukhi letters).

@vvasuki Thank you very much for your response.

If we're ditching toml because schema specification and validation is simpler elsewhere, might as well pick https://json5.org/ .

I did a little bit of looking around and it seems that JSON5 is quite JavaScript-specific and that implementations of JSON Schema in other languages generally do not support JSON5, so this would mean introducing a Node.js dependency to other implementations of Sanscript in order to convert the JSON5 to JSON. So, if I understand correctly, JSON5 would be simpler to use than TOML in the case of JavaScript (because ajv supports JSON5 natively) but more difficult in other languages (because JSON5 needs to be transcoded to JSON using Node.js before validation or the validation needs to happen in Node.js).

Action item for later:

  • We could separate the schema definitions out into a separate repository which can then ben included as a submodule here and in other language specific projects.

Sounds good!

vvasuki commented 4 years ago

Ok - sent you an invite to be a maintainer. Once ready and @rramphal takes a look, merge into master and let me know so that I can update npm.

rramphal commented 4 years ago

Solid work, @psvenk!

I'd like to suggest some improvements, but they are all pretty much outside the scope of this PR. I can get to them in a couple of hours. After @psvenk merges this into master, @vvasuki, if you can hold off on publishing to npm, I'd be glad to submit a PR with some ideas for improvements if that sounds good.

rramphal commented 4 years ago

Also, I went ahead and captured some of the ideas mentioned here in #15 and #16.

psvenk commented 4 years ago

I'd like to suggest some improvements, but they are all pretty much outside the scope of this PR. I can get to them in a couple of hours.

Great. I already have a schema done, so I can open a separate PR after merging this to gather feedback about the schema. The schema was written so that the current schemes pass validation, but some changes to the scheme format and the logic would be necessary to make the schema simpler (e.g. replace singleton arrays with simple strings). Additionally, some hard-coded rules (e.g. Brahmic vs. roman and the definition of the Kolkata scheme) should be moved out of the logic and made into parameters in the schema or moved into dist.js, but that too can be done in a separate PR.

rramphal commented 4 years ago

@psvenk Awesome. I think getting a validator in to represent the current schema for now, and then following up with a PR later to move some of the logic over sounds like a good plan.

psvenk commented 4 years ago

@psvenk Awesome. I think getting a validator in to represent the current schema for now, and then following up with a PR later to move some of the logic over sounds like a good plan.

@rramphal In that case, I'll open that PR up for review and open a separate issue for moving the logic over.