Closed aethanyc closed 2 years ago
Yes, the dependency on unicode-segmentation
was always intended to be temporary. I think we should merge the LSTM into the main crate. The "lstm" feature can disable any extra dependencies if desired.
To be solved at the same time as https://github.com/unicode-org/icu4x/issues/1654
To be solved at the same time as https://github.com/unicode-org/icu4x/issues/1654
lstm data doesn't move to testdata yet, so this issue will be fixed after data provider support of lstm is finished (https://github.com/unicode-org/icu4x/issues/905).
I see three options:
I think even if we do (3), we may still want a feature flag so that we don't carry the grapheme data whenever anyone uses LSTM.
For 1.0, do option 1 above.
segmenter_lstm/src/lstm.rs
uses the externalunicode-segmentation
crate to iterate over the grapheme clusters, but now we implemented grapheme cluster breaker in segmenter.https://github.com/unicode-org/icu4x/blob/5a015efd970e2a008d24f67f7d13598ae5901223/experimental/segmenter_lstm/src/lstm.rs#L106-L108
It doesn't seem possible for two creates to depend on each other. If it makes sense to remove
unicode-segmentation
to deduce the dependency, maybe mergingsegmenter
andsegmenter_lstm
is a feasible solution.Thought? @sffc @Manishearth @SahandFarhoodi @makotokato