Benchmarking various solutions for counting word and phrase frequency in corpora.
Also provides ready-to-use win32/win64 binaries of grep, ag aka silver searcher, pt aka platinum searcher and sift for those who are too lazy to compile their own and just want the best tool for the job.
There are at least half a dozen popular utilities that search for strings inside text files. Most of them claim to be the fastest. To test their claims, we put them to the test.
corpus.txt
file is a 792 MB fragment of the OpenSubtitles2016 corpus freely available here / direct link to the English versionBenchmarks are always hotly contested. Your mileage may vary. However, some conclusions come to mind:
Utility | Average Time | Characters per second |
---|---|---|
ag 0.29.1 | 1.691857143 | 487,376,204 |
ag 0.31.0 | 2.035142857 | 405,166,109 |
GNU grep 2.5.1 | 3.366142857 | 244,960,166 |
GNU grep 2.5.4 | 3.109571429 | 265,171,883 |
GNU grep 2.5.4 | 4.200428571 | 196,306,376 |
GNU grep 2.0d | 4.490285714 | 183,634,398 |
GNU grep 2.24 | 2.710285714 | 304,237,634 |
GNU grep 2.3 | 1.225714286 | 672,726,851 |
GNU grep 2.4.2 | 1.246571429 | 661,471,050 |
GNU grep 2.4.2 | 1.276714286 | 645,853,909 |
pt 2.1.2 | 236.8115714 | 3,481,971 |
pt 2.1.2 | 224.435 | 3,673,985 |
sift 0.8.0 | 15.03757143 | 54,834,048 |
sift 0.8.0 | 4.597142857 | 179,365,954 |
find MS windows 8.1 | 196.0131429 | 4,206,712 |
Utility | Command | Run #1 | Run #2 | Run #3 | Run #4 | Run #5 | Run #6 | Run #7 |
---|---|---|---|---|---|---|---|---|
ag 0.29.1 | binaries\ag\ag -ciF "fair game" corpus.txt | 1.749 | 1.687 | 1.681 | 1.682 | 1.679 | 1.678 | 1.687 |
ag 0.31.0 | binaries\ag64\ag -ciF "fair game" corpus.txt | 2.344 | 2.018 | 1.928 | 1.931 | 2.008 | 2.008 | 2.009 |
GNU grep 2.5.1 | binaries\grep1\grep -ciF "fair game" corpus.txt | 3.44 | 3.348 | 3.35 | 3.36 | 3.363 | 3.354 | 3.348 |
GNU grep 2.5.4 | binaries\grep2\grep -ciF "fair game" corpus.txt | 3.181 | 3.092 | 3.093 | 3.102 | 3.09 | 3.103 | 3.106 |
GNU grep 2.5.4 | binaries\grep3\grep -ciF "fair game" corpus.txt | 4.292 | 4.159 | 4.199 | 4.189 | 4.187 | 4.189 | 4.188 |
GNU grep 2.0d | binaries\grep4\grep -ciF "fair game" corpus.txt | 4.566 | 4.471 | 4.468 | 4.491 | 4.472 | 4.468 | 4.496 |
GNU grep 2.24 | binaries\grep5\grep -ciF "fair game" corpus.txt | 2.791 | 2.697 | 2.694 | 2.699 | 2.697 | 2.691 | 2.703 |
GNU grep 2.3 | binaries\grep6\grep -ciF "fair game" corpus.txt | 1.294 | 1.219 | 1.227 | 1.208 | 1.21 | 1.21 | 1.212 |
GNU grep 2.4.2 | binaries\grep7\grep -ciF "fair game" corpus.txt | 1.335 | 1.215 | 1.318 | 1.211 | 1.212 | 1.214 | 1.221 |
GNU grep 2.4.2 | binaries\grep8\grep -ciF "fair game" corpus.txt | 1.401 | 1.256 | 1.266 | 1.257 | 1.245 | 1.26 | 1.252 |
pt 2.1.2 | binaries\pt\pt /c /i "fair game" corpus.txt | 248.294 | 234.95 | 234.85 | 234.943 | 234.922 | 235.018 | 234.704 |
pt 2.1.2 | binaries\pt64\pt /c /i "fair game" corpus.txt | 227.625 | 223.743 | 223.763 | 223.813 | 224.1 | 223.907 | 224.094 |
sift 0.8.0 | binaries\sift\sift -cQi "fair game" corpus.txt | 16.298 | 14.87 | 14.835 | 14.805 | 14.806 | 14.816 | 14.833 |
sift 0.8.0 | binaries\sift64\sift -cQi "fair game" corpus.txt | 5.062 | 4.524 | 4.508 | 4.498 | 4.522 | 4.51 | 4.556 |
Utility | Average Time | Characters per second |
---|---|---|
ag 0.29.1 | 1.7064 | 483,222,522 |
ag 0.31.0 x64 | 1.9827 | 415,882,843 |
GNU grep 2.5.1 unxutils | 3.1709 | 260,043,178 |
GNU grep 2.5.4 gnuwin32 | 2.9238 | 282,020,286 |
GNU grep 2.5.4 msys | 4.0008 | 206,101,508 |
GNU grep 2.0d tcharron | 4.2958 | 191,948,161 |
GNU grep 2.24 cygwin x64 | 2.7343 | 301,565,634 |
GNU grep 2.3 fender | 1.0047 | 820,713,558 |
GNU grep 2.4.2 wbin | 1.02 | 808,402,855 |
GNU grep 2.4.2 msys | 1.0334 | 797,920,372 |
pt 2.1.2 | 5.9739 | 138,028,911 |
pt 2.1.2 x64 | 4.8024 | 171,699,757 |
sift 0.8.0 | 2.2718 | 362,959,289 |
sift 0.8.0 x64 | 1.5032 | 548,543,715 |
find MS windows 8.1 | too long | too long |
Utility | Command | Run #1 | Run #2 | Run #3 | Run #4 | Run #5 | Run #6 | Run #7 | Run #8 | Run #9 | Run #10 |
---|---|---|---|---|---|---|---|---|---|---|---|
ag 0.29.1 | binaries\ag\ag -cF "fair game" corpus.txt | 1.707 | 1.708 | 1.704 | 1.708 | 1.711 | 1.701 | 1.708 | 1.704 | 1.71 | 1.703 |
ag 0.31.0 x64 | binaries\ag64\ag -cF "fair game" corpus.txt | 1.987 | 2.051 | 1.967 | 1.967 | 2.045 | 1.96 | 1.956 | 1.969 | 1.967 | 1.958 |
GNU grep 2.5.1 unxutils | binaries\grep1\grep -cF "fair game" corpus.txt | 3.174 | 3.178 | 3.176 | 3.169 | 3.172 | 3.166 | 3.163 | 3.171 | 3.164 | 3.176 |
GNU grep 2.5.4 gnuwin32 | binaries\grep2\grep -cF "fair game" corpus.txt | 2.925 | 2.935 | 2.916 | 2.933 | 2.924 | 2.92 | 2.929 | 2.916 | 2.917 | 2.923 |
GNU grep 2.5.4 msys | binaries\grep3\grep -cF "fair game" corpus.txt | 4.042 | 4.04 | 3.98 | 4.02 | 3.988 | 3.978 | 3.987 | 3.988 | 3.985 | 4 |
GNU grep 2.0d tcharron | binaries\grep4\grep -cF "fair game" corpus.txt | 4.301 | 4.292 | 4.302 | 4.297 | 4.295 | 4.286 | 4.287 | 4.301 | 4.31 | 4.287 |
GNU grep 2.24 cygwin x64 | binaries\grep5\grep -cF "fair game" corpus.txt | 2.828 | 2.762 | 2.716 | 2.723 | 2.776 | 2.711 | 2.708 | 2.71 | 2.703 | 2.706 |
GNU grep 2.3 fender | binaries\grep6\grep -cF "fair game" corpus.txt | 1.05 | 0.997 | 0.996 | 1.008 | 0.997 | 1.003 | 0.995 | 0.998 | 0.995 | 1.008 |
GNU grep 2.4.2 wbin | binaries\grep7\grep -cF "fair game" corpus.txt | 1.167 | 1.004 | 1 | 1.001 | 1.003 | 1 | 1.002 | 1.001 | 1.01 | 1.012 |
GNU grep 2.4.2 msys | binaries\grep8\grep -cF "fair game" corpus.txt | 1.039 | 1.039 | 1.031 | 1.029 | 1.04 | 1.025 | 1.036 | 1.037 | 1.031 | 1.027 |
pt 2.1.2 | binaries\pt\pt /c "fair game" corpus.txt | 6.419 | 5.929 | 5.924 | 5.925 | 5.919 | 5.924 | 5.916 | 5.914 | 5.961 | 5.908 |
pt 2.1.2 x64 | binaries\pt64\pt /c "fair game" corpus.txt | 5.086 | 4.764 | 4.765 | 4.786 | 4.789 | 4.768 | 4.791 | 4.756 | 4.757 | 4.762 |
sift 0.8.0 | binaries\sift\sift -cQ "fair game" corpus.txt | 2.616 | 2.231 | 2.227 | 2.236 | 2.231 | 2.245 | 2.223 | 2.238 | 2.24 | 2.231 |
sift 0.8.0 x64 | binaries\sift64\sift -cQ "fair game" corpus.txt | 1.995 | 1.441 | 1.442 | 1.452 | 1.459 | 1.456 | 1.438 | 1.452 | 1.457 | 1.44 |
measure.cmd
(case insensitive), measure2.cmd
(case sensitive) and the respective benchmark runners run-measure.cmd
and run-measure2.cmd
cat corpus.txt | wc -m
corpus.txt
was used for testing case sensitive (to give comparable and measurable results)time
on windows, the freeware utility ptime
was used to measure running time. see herebinaries
folder of this repositoryFor a lengthy discussion, see: