hankcs / AhoCorasickDoubleArrayTrie

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.
http://www.hankcs.com/program/algorithm/aho-corasick-double-array-trie.html
946 stars 289 forks source link

建字典树时,当词条数目超过1000000时,总是报错"OutOfMemoryError: GC overhead limit exceeded" #38

Open gaohang opened 4 years ago

gaohang commented 4 years ago

字典容量有什么限制吗? 机器内存是64G,内存够用应该。

hankcs commented 4 years ago

这个结构以utf16为码表,不适合储存大词典。汉字的Unicode区间为0x4E00--0x9FA5,比较分散。你可以尝试用字节做码表。

gaohang commented 4 years ago

Compared with hashmap, DAT consumes less memory. However, hashmap of 100000000 docs can be build in memory, while DAT with 10000000 docs leads to OOM?