unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Consider using grapheme cluster breaker in segmenter_lstm #1531

Closed aethanyc closed 2 years ago

aethanyc commented 2 years ago

segmenter_lstm/src/lstm.rs uses the external unicode-segmentation crate to iterate over the grapheme clusters, but now we implemented grapheme cluster breaker in segmenter.

https://github.com/unicode-org/icu4x/blob/5a015efd970e2a008d24f67f7d13598ae5901223/experimental/segmenter_lstm/src/lstm.rs#L106-L108

It doesn't seem possible for two creates to depend on each other. If it makes sense to remove unicode-segmentation to deduce the dependency, maybe merging segmenter and segmenter_lstm is a feasible solution.

Thought? @sffc @Manishearth @SahandFarhoodi @makotokato

sffc commented 2 years ago

Yes, the dependency on unicode-segmentation was always intended to be temporary. I think we should merge the LSTM into the main crate. The "lstm" feature can disable any extra dependencies if desired.

dminor commented 2 years ago

To be solved at the same time as https://github.com/unicode-org/icu4x/issues/1654

makotokato commented 2 years ago

To be solved at the same time as https://github.com/unicode-org/icu4x/issues/1654

lstm data doesn't move to testdata yet, so this issue will be fixed after data provider support of lstm is finished (https://github.com/unicode-org/icu4x/issues/905).

sffc commented 2 years ago

I see three options:

  1. Add a feature "lstm-grapheme" that enables the dependency
  2. Remove support for grapheme-based LSTM
  3. Update the call site at https://github.com/unicode-org/icu4x/blob/68107ef8a4142be0f2172d23a4703a7b40ca419f/experimental/segmenter/src/lstm_bies.rs#L110

I think even if we do (3), we may still want a feature flag so that we don't carry the grapheme data whenever anyone uses LSTM.

sffc commented 2 years ago

For 1.0, do option 1 above.