Closed dfish3r closed 7 years ago
See #47
Some statistics for a 672MB file with 1G max heap Java process:
WordList | Cache Percent | Init Time | Search Time | Heap Size |
---|---|---|---|---|
FileWordList | 0% | 18s | 72s | 224MB |
FileWordList | 1% | 19s | 2.9ms | 367MB |
FileWordList | 5% | 20s | 1.3ms | 519MB |
FileWordList | 10% | 21s | 0.9ms | 625MB |
FileWordList | 15% | 34s | 1.1ms | 822MB |
MemoryMappedFileWordList | 0% | 5s | 43s | 224MB |
MemoryMappedFileWordList | 1% | 6s | 2.0ms | 367MB |
MemoryMappedFileWordList | 5% | 7s | 0.7ms | 514MB |
MemoryMappedFileWordList | 10% | 8s | 0.5ms | 619MB |
MemoryMappedFileWordList | 15% | 21s | 0.4ms | 819MB |
@serac may have an outstanding issue with unicode characters. A different PR can be used if any further changes are needed.
Improve FileWordList by using BufferReader in #readFile. Change the meaning of cachePercent to apply to file size rather than number of lines. (It's meaning was never well defined.) This allows the cache to be built inline while reading the file, removing the need to read the file twice. This will generally result in larger caches, but users can tune the cache size down if that is an issue.