unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 175 forks source link

Document intent for regular expressions #37

Closed hsivonen closed 4 years ago

hsivonen commented 4 years ago

ecosystem.md mentions icu::Regex. The Rust regex crate already exists and is very performant (in part due to not supporting some Perl-popularized features that aren't actually regular and hinder performance).

It might be useful to signal intent in this area at some point.

Does the project seek to provide regular expressions that operate on UTF-8 for Rust apps? If so, what would be the elevator pitch relative to the regex crate?

Does the project seek to provide regular expressions that operate on UTF-16 and Latin1 and conform to ECMAScript regular expressions for use in JavaScript engines? If so, what would be the elevator pitch relative to what SpiderMonkey and V8 already have?

Does the project seek to provide regular expressions that Dart or Go programs would use? If so, what would be the elevator pitch relative to what the standard libraries of these languages provide?

Does the project seek to provide regular expressions that C or C++ apps would use via FFI? If so, would this just FFI around the regex crate (i.e. UTF-8), something new, or for UTF-16?

zbraniecki commented 4 years ago

Is there any performance/memory benchmark that takes regex against irrexep or other popular reggexp engines?

nciric commented 4 years ago

I think there are two potential reasons for developing icu4x regex:

  1. Full Unicode support (script ranges for example)
  2. Potential for easy Wasm compilation

Now, I don't know if Rust regex crate already offers this. If so, we could just fallback to it, without developing our own, if licencing is not a problem.

пет, 17. апр 2020. у 01:33 Zibi Braniecki notifications@github.com је написао/ла:

Is there any performance/memory benchmark that takes regex against irrexep or other popular reggexp engines?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/omnicu/issues/37#issuecomment-615118516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7GEKVSNHCW2SPUJ7OU4DTRNAH53ANCNFSM4MKRXKNQ .

zbraniecki commented 4 years ago

if licencing is not a problem.

According to the latest I saw, it shouldn't be!

https://github.com/rust-lang/regex#license

nciric commented 4 years ago

It seems it supports some level of Unicode algo - https://github.com/rust-lang/regex/blob/master/UNICODE.md

Which brings another question - what to do with Unicode properties. Can they be shared across crates?

пет, 17. апр 2020. у 10:32 Zibi Braniecki notifications@github.com је написао/ла:

if licencing is not a problem.

According to the latest I saw, it shouldn't be!

https://github.com/rust-lang/regex#license

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/unicode-org/omnicu/issues/37#issuecomment-615372449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7GEKVALZQC6IZUJELQHFTRNCHBZANCNFSM4MKRXKNQ .

macchiati commented 4 years ago

I just looked over the description on https://github.com/rust-lang/regex/blob/master/UNICODE.md. The support is, in general, quite good.

The main area where it falls down is in support of more Unicode properties. A second question I have is how good Rust is about updating to the newest version of Unicode, and whether there is an API in Rust to detect the version of Unicode supported.

Mark

On Fri, Apr 17, 2020 at 11:42 AM Nebojša Ćirić notifications@github.com wrote:

It seems it supports some level of Unicode algo - https://github.com/rust-lang/regex/blob/master/UNICODE.md

Which brings another question - what to do with Unicode properties. Can they be shared across crates?

пет, 17. апр 2020. у 10:32 Zibi Braniecki notifications@github.com је написао/ла:

if licencing is not a problem.

According to the latest I saw, it shouldn't be!

https://github.com/rust-lang/regex#license

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/unicode-org/omnicu/issues/37#issuecomment-615372449 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA7GEKVALZQC6IZUJELQHFTRNCHBZANCNFSM4MKRXKNQ

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/omnicu/issues/37#issuecomment-615403384, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMAMWC35IG24CBZL7RTRNCPH5ANCNFSM4MKRXKNQ .

sffc commented 4 years ago

I'll take this issue and add a note about this to ecosystem.md. I plan to send a PR to that doc with a new column saying to what degree we want to pull in existing code from each crate.

sffc commented 4 years ago

I added this to #41 and am documenting that we don't intend to take action on regex support in ICU4X at this time.