Open jannson opened 11 years ago
实现还不简单,大概如这哥们说的,http://stackoverflow.com/questions/524342/how-to-store-a-hash-table-in-a-file: Ditch the pointers for indices.
This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:
Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
When chaining is needed use the free space in the table.
If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
record the index (let's call it N)
swap the new element and the squatter
put the squatter in a new free index, (F).
follow the chain on the squatter's hash index, to replace N with F.
If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.
This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.
不过此问题被接受的答案是使用boost的serialize来实现,那就没办法使用mmap来共享内存了。所以正确方法还得是上面的这个方法。
共享的第三方存储如何?比如memory cached?还有别的可以长期驻留内存的第三方存储。 将字典serialized,然后存入内存存储,这样速度比从磁盘加载应该快很多。 这种计算密集的操作线程的性能不知道如何……multithread也是个不错的选择吧……类似于http server的思路。
memcache 和 redis 取出来的东西还需要用 json 处理成字典,这个过程很耗时间,尤其是字典比较大的时候。真想找个进程间共享内存的办法在多个独立的进程间共享内存
每个进程加载字典的时候都需要几秒时间,并且每个进程都需要大量的内存来存储这样的字典。若多个进程使用mmap共享使用一个最大的只读字典,将可能是一个不错的方案。 准备先修改一个牛人写的cppjieba来实现这个设想,请有兴趣的同学关注:https://github.com/jannson/cppjieba
提出这个方案的原因是因为在自己的两个项目当中确实需要多个进程去处理分词,而每次加载字典的时候都要消息时间,等得不爽!
当然cppjieba里已经使用http方式集中处理分词,也是一个参考的方向。