open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

Forked library; and some thoughts about whether it's worth it to keep all modules at same Unicode version #279

Open ctrlcctrlv opened 3 years ago

ctrlcctrlv commented 3 years ago

I'm working on a font editor, MFEK. I also contribute to Unicode when I can. One of my fonts requires characters in Unicode 14.0.

For those reasons, I had to fork the project. I only need blocks, categories, and names, so I called my version QD-UNIC—“quick and dirty UNIC”. https://github.com/MFEK/qd-unic.rlib

I think that, perhaps, this project was too ambitious, in the sense that all the modules must match each other in Unicode version. That's what's caused a single PR, #226, to stall development of everything because of issues with unic-ucd-segment.

Obviously some of these modules are very easy to keep updated, and unic-gen works phenomenally well. Those implementing things like text segmentation and BIDI are going to be more difficult, and certainly subject to the needs of the community…which more often match mine than not. Basically, in short, users who only care about getting character names shouldn't suffer because no one has yet contributed a fix to a text segmentation problem.

Anyway, I doubt y'all will agree, which is why I forked, but I thought I'd let you know why I forked.

eyeplum commented 3 years ago

I'm in a similar position with you too: I have a Unicode tool in production and need to keep the data of the app up-to-date (currently Unicode 13.0). So far I've been updating rust-unic in my own fork only https://github.com/eyeplum/rust-unic. Since my app also has a feature to perform grapheme segmentation, I also attempted a fix for unic-ucd-segment in Unicode 11.0 (which I presume had worked since all tests are passing at the moment in my fork).

I would love to eventually merge my fork back so that we could keep rust-unic up-to-date (iirc updates after 11.0 are pretty straightforward).

I will try to find some time to break the changes on my fork into small PRs and see.


As for decoupling each modules so that they can have different Unicode versions, it does sound pretty tempting to me as well to have something like a separate UCD module which is always kept up-to-date (probably trivially), since that's my main use case as well. Maybe breaking rust-unic into separate repos and use versioning would make it possible?

E.g. in the unified project, rust-unic depends on rust-unic-ucd (where rust-unic-ucd is a separate project), the Unicode version is kept the same between rust-unic and rust-unic-ucd:

rust-unic (Unicode 13.0)
`-- rust-unic-ucd (Unicode 13.0)

In its own project, rust-unic-ucd can have any Unicode version it supports (as git tags or branches):

rust-unic-ucd
- branch: master => Unicode 13.0 (the latest released version of the Unicode Standard)
- branch: next => Unicode 14.0 (the next release of the Unicode Standard)
- tag: unicode-12.1 => Unicode 12.1
- tag: unicode-12.0 => Unicode 12.0
- tag: unicode-11.0 => Unicode 11.0
- ...

I'm actually quite excited when playing around this idea in my head, but I haven't thought of all ramifications.

zbraniecki commented 3 years ago

Hi all! You may want to consider helping us with ICU4x project! One of the power features were working on is robust data provider which works with Unicode properties and should address your needs - we're currently focused on supplying the needs of regular expression and segmentation APIs but would be open to collaborate on other targets !

CAD97 commented 3 years ago

Yeah, for the time being, ICU4x is the way to go. At least until @behnam is active again, this project is effectively on hiatus.

If you ping me, I think I can still merge PRs, but I wouldn't personally suggest using unic as a unicode table provider at the moment.