Open adrianTNT opened 4 years ago
Interesting observation. I can corroborate this with my set of tests. The parallelism should be fairly good when recursing a large directory tree while searching a few files only.
For example, a recursive search for #include "..."
in the directory tree from the Qt 5.9.2 root, restricted to .h
, .hpp
, and .cpp
files only shows that 2 threads is optimal on my Mac 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3:
% /usr/bin/time ag -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
0.45 real 0.39 user 0.55 sys
4475
% /usr/bin/time ag --workers 1 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
0.32 real 0.28 user 0.27 sys
4475
% /usr/bin/time ag --workers 2 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
0.29 real 0.30 user 0.31 sys
4475
% /usr/bin/time ag --workers 4 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
0.31 real 0.33 user 0.37 sys
4475
% /usr/bin/time ag --workers 6 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
0.41 real 0.38 user 0.50 sys
4475
I picked the best times. The second run is the best at 0.29s with two workers.
This is surprising. We're searching 57% of the files (2446 C++ source code files of 4256 files total) for which workers should be spawned. Clearly, ag
does a poor job at farming the workers. Another likely cause is mutex locking, e.g. when allocating memory or due to IO sequentialization.
This performance test case is from ugrep test T8.
I am not sure if this is an
ag
issue or a more general linux issue, but I am noticing it withag
, I hope someone has some tipsI have an Ubuntu based Linux installed as VM using VMware player;
When I allocate 4 CPU threads in VM settings, some commands take longer to run than when allocating 2 CPU threads.
The files it searches are small files, around 23 000 in total, but only 110 MB.
Notice how first run the command always take longer, then 2nd and 3rd time it runs much faster, but with less threads it is even faster.
Why is that ?
Even worse, if I give it 12 out of 16 CPU threads, it never seems to cache that command, it always takes same amount of time when I repeat the command. Is this a CPU cache thing or a memory cache thing ?
Same command on a dedicated machine with 8 threads takes 0.300 sec, that is what I think is ~normal.
Edit: in VM I have a cpu monitor chart, and whenever there are more threads, the charts never reach top (~100% usage) and that is also when repeated commands still run slow. But when I have 2 threads, running the command shows the CPU usage at 100%, and then repeating the command is much faster. Like LINUX would not cache that content or command unless the CPU was under significant load ? It considers it was just some light processing work ?!