ag command runs faster with fewer CPU threads

I am not sure if this is an ag issue or a more general linux issue, but I am noticing it with ag, I hope someone has some tips

I have an Ubuntu based Linux installed as VM using VMware player;

I am testing with 8GB and 16GB memory in the VM, the host has 24 GB total. Host also has 16 threads in total, it is an AMD Ryzen. Host disk is NVME SSD. Vm and host disks are defragmented.

When I allocate 4 CPU threads in VM settings, some commands take longer to run than when allocating 2 CPU threads.

The files it searches are small files, around 23 000 in total, but only 110 MB.

Notice how first run the command always take longer, then 2nd and 3rd time it runs much faster, but with less threads it is even faster.

Why is that ?

Even worse, if I give it 12 out of 16 CPU threads, it never seems to cache that command, it always takes same amount of time when I repeat the command. Is this a CPU cache thing or a memory cache thing ?

Same command on a dedicated machine with 8 threads takes 0.300 sec, that is what I think is ~normal.

Edit: in VM I have a cpu monitor chart, and whenever there are more threads, the charts never reach top (~100% usage) and that is also when repeated commands still run slow. But when I have 2 threads, running the command shows the CPU usage at 100%, and then repeating the command is much faster. Like LINUX would not cache that content or command unless the CPU was under significant load ? It considers it was just some light processing work ?!

SEARCH WITH 4 CPU THREADS
=========================

time ag -li 'foo' /my_files

real    0m8.045s
user    0m0.169s
sys 0m7.014s

time ag -li 'foo' /my_files

real    0m1.460s
user    0m0.330s
sys 0m1.907s

time ag -li 'foo' /my_files

real    0m1.466s
user    0m0.315s
sys 0m1.882s

SEARCH WITH 2 CPU THREADS
=========================

time ag -li 'foo' /my_files

real    0m4.438s
user    0m0.039s
sys 0m2.679s

time ag -li 'foo' /my_files

real    0m0.368s
user    0m0.069s
sys 0m0.491s

time ag -li 'foo' /my_files

real    0m0.345s
user    0m0.104s
sys 0m0.429s

Interesting observation. I can corroborate this with my set of tests. The parallelism should be fairly good when recursing a large directory tree while searching a few files only.

For example, a recursive search for #include "..." in the directory tree from the Qt 5.9.2 root, restricted to .h, .hpp, and .cpp files only shows that 2 threads is optimal on my Mac 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3:

% /usr/bin/time ag -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
        0.45 real         0.39 user         0.55 sys
    4475
% /usr/bin/time ag --workers 1 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
        0.32 real         0.28 user         0.27 sys
    4475
% /usr/bin/time ag --workers 2 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
        0.29 real         0.30 user         0.31 sys
    4475
% /usr/bin/time ag --workers 4 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
        0.31 real         0.33 user         0.37 sys
    4475
% /usr/bin/time ag --workers 6 -ro '#[[:space:]]*include[[:space:]]+"[^"]+"' -G '.*\.(h|hpp|cpp)' | wc -l
        0.41 real         0.38 user         0.50 sys
    4475

I picked the best times. The second run is the best at 0.29s with two workers.

This is surprising. We're searching 57% of the files (2446 C++ source code files of 4256 files total) for which workers should be spawned. Clearly, ag does a poor job at farming the workers. Another likely cause is mutex locking, e.g. when allocating memory or due to IO sequentialization.

This performance test case is from ugrep test T8.

ggreer / the_silver_searcher

ag command runs faster with fewer CPU threads #1352