修改一行代码可以让DAT构建速度提升N倍

qiangwang commented 1 year ago

Describe the feature and the current behavior/state.

int pos = Math.max(siblings.get(0).code + 1, nextCheckPos) - 1; 改成 int pos = Math.max(siblings.get(0).code, nextCheckPos);

构建速度可以提升很多而且表现稳定，缺点是最终构建出的DAT大小微增。下面是我的测试数据：

其中保留nextCheckPos为原版代码，去掉nextCheckPos用作对比，核心循环计数是在循环体内做了一个count++计数：

Will this change the current api? How?

No

Who will benefit with this feature?

Anyone who uses DAT

Are you willing to contribute it (Yes/No):

Yes

System information

Any other info

hankcs commented 1 year ago

感谢共享你的发现。我认为，去掉减1后，nextCheckPos在很多时候不会被-1。又由于循环体开始时会对pos+1： https://github.com/hankcs/HanLP/blob/d34dab3aa5a3f6eaf25c198ecadaed4aa0fb9270/src/main/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrie.java#L178 这就导致每次调用insert其实都是从nextCheckPos+1开始搜索，每次都空出来了1个位置，于是导致了DAT体积微增。空出来的位置越多，下次insert越容易搜索到空闲空间，所以搜索时间就降下来了。

不过DAT的构建一般是离线进行，时间开销不敏感。另外，体积增大意味着内存占用也增大。时间与空间，不同场景有不同的偏好吧。如果你有兴趣给所有build方法加一个boolean fast参数，我很乐意merge。

qiangwang commented 1 year ago

好的，我们的应用场景是会频繁变更和构建，构建时间需要稳定在1s以内。建议加个enableFastBuild成员变量，不用层层透传，也不影响现存接口，我找时间提个mr。

hankcs / HanLP