硬编码词典？ - Githubissues

messense / mmseg-rs

Chinese word segmentation algorithm MMSEG in Rust

MIT License

7 stars 0 forks source link

Closed LuoZijun closed 6 years ago

LuoZijun commented 6 years ago

你好，使用 slice 硬编码词典会是一个好的方式吗？

我写过隐私马尔科夫模型，采用的是硬编码，体积似乎也没有很大。

messense commented 6 years ago

可以用 no-default-features disable 掉 embed dict，自行调用 load_dict 加载自定义词典。

LuoZijun commented 6 years ago

好的，我有时间研究下。

messense commented 6 years ago

LuoZijun commented 6 years ago

@messense 哇，厉害 👍

LuoZijun commented 6 years ago

@messense 我刚刚运行了下你的 jieba-rs/examples ，发现内存占用还是比较高的，将近 76MB （CPP jieba 是将近 120MB）。

另外，如果使用 cd examples/weicheng; cargo run 的话，好像有问题，半天也执行不完。如果使用 cd examples/weicheng; cargo run --release 的话，则没有问题，返回数字 10 .

耗时看起来非常不理想，应该是字典的处理问题。

建议直接采用 slice ，这样处理《围城》的耗时会是毫秒级别的，内存占用大概是 15.7MB。

messense commented 6 years ago

@LuoZijun Rust debug 版慢挺正常的吧，没有开多少优化。而且那个 example 对围城按行分词了 50 遍。

LuoZijun commented 6 years ago

刚看到，分词 50 遍 :))

LuoZijun commented 6 years ago

Debug 版本刚运行完，在我的硬件上面大概 576 秒。

messense commented 6 years ago

LuoZijun commented 6 years ago

你的机器使用 cargo run --release 也是 10 秒左右？我是 MacBook Pro (i7) ，看起来还有优化空间。

messense commented 6 years ago

MacBook Pro (Retina, 15-inch, Mid 2015) 也是 10 秒左右，确实有优化空间：