Closed kalshett closed 2 years ago
I can reproduce the issue on a P10 LPAR with:
~/src/power-gzip/samples$ gcc -g -O3 -o zpipe-repeat-test zpipe-repeat-test.c ../lib/libnxz.a -ldl
~/src/power-gzip/samples$ echo "not so useful text" > junk
~/src/power-gzip/samples$ tar -cf junk.tar junk
~/src/power-gzip/samples$ ./zpipe-repeat-test < junk.tar > junk.Z
Looks like the code is actually progressing, but at a really slow pace. perf
indicates it spends ~82% of the time on the close
call from nx_function_end
.
Related info: the test tries 1M iterations, I reduced it to 1k to force it to end sooner. It is MUCH slower on P10 LPAR.
P9 baremetal:
$ time ./zpipe-repeat-test < junk.tar > junk.Z
deflate 1000 times
real 0m0.071s
user 0m0.003s
sys 0m0.066s
P10 LPAR:
$ time ./zpipe-repeat-test < junk.tar > junk.Z
iteration 0
deflate 1000 times
real 0m33.233s
user 0m0.126s
sys 0m0.000s
Let me correct my last comment: the program seems to be spending >80% of the time in the kernel, both for deflateInit
and deflateEnd
. Most of it on vas_allocate_window
and vas_deallocate_window
.
There should be hack here or about to do that nx_function_begin and nx_function_end only once... https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L492
Not claiming it to be best hack or even correct. But it used to work for me.
The problem is as you observed, those expensive system calls. The idea behind the so called hack was to keep the vas window open once open therefore pay the price once. And subsequent inflateinit deflateinit would return quickly reusing the existing open window. I am guessing this is where the problem and the solution are
If so, this is probably the same as #30. I'll give the hack a try next week.
For each window open (vas_allocate_window()), 2 HCALLs (open and modify window calls) are issued and irq setup. Whereas 1 close HCALL is issued for window close (vas_deallocate_window()). So the main difference over powerNV is these HCALLs to the hypervisor. I think these HCALLs are taking more time and we need to measure these HCALLs performance to find the bottleneck. Looks like deallocate window HCALL is consuming more time.
If so, this is probably the same as #30. I'll give the hack a try next week.
Couple notes on this subject:
Main idea was to keep the vas window open within the same PID once open. If the application repeatedly does inflateInit/End or same with deflate, then the hack will keep the vas window open to eliminate the overhead. You'd expect apps to do the Init/End calls once but some don't, and even the zlib itself doesn't, for example see compress() and uncompress() functions which return no handle to keep around. Therefore, to eliminate overhead there was a need to keep the vas window open under the covers as long as the PID existed. e.g., https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L75
If the process forked, kernel did not copy the open vas handle to the new process as I recall. Therefore, there is a fork check here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L492 and then doing a real open if the process forked.
I tried to implement mutual exclusion while multiple threads would be doing open and close here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L490. I cannot say that the use of atomic_fetch_add is the correct approach or even correct. I was trying to minimize the inflate/deflateInit overheads as much as possible (important for small block performance). And I was under the impression that pthread based exclusion overhead would be higher than an atomic op which map to a single instruction --but perhaps a false impression.
Finally, we made this cached vas handle behavior configurable here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L488. Trying to refresh my memory, I think the main reason is because of when a process migrates from one chip to another, reused vas window pinned the process to the nx-gzip on the original chip where open was made. Functionally there is no problem; it still works. But causes imbalanced in nx-gzip use, and as well as degraded performance for reaching over the numa links. For example, a process runs on chip 1 and memory also allocated from chip 1, but since it's connected to nx-gzip on chip 0 the input/output to nx-gzip must go across two chips.
The reuse handle optimizations is for small data blocks. For large data blocks if the user disables this optimization (which eliminate the numa problem) the performance may be ok because the open/close overhead becomes smaller part of the overall run time.
causes imbalanced in nx-gzip use, and as well as degraded performance for reaching over the numa links.
By the way, I should have a load balancing patch in one of the branches. Basically, library opens a vas-window on all chips. In the round robin mode, it sends jobs to all nx-gzip units, round-robin as the name implies. The idea is to prevent hot-spots e.g. what if all processes are queued at a single nx-gzip? However round robin still pays the numa penalty. In the feedback directed mode, the library measures the delays and steers jobs to the unit that has given the least delay in the past few jobs. That usually means choosing the local unit.
So after reading the code, looks like the hack is already in place, right? I ran the same test on PowerVM but now with VAS window reuse, and it's MUCH faster.
P10 LPAR
~/src/power-gzip/samples> time NX_GZIP_DIS_SAVDEVP=0 NX_GZIP_LOGFILE=$(pwd)/nx.log ./zpipe-repeat-test < junk.tar > junk.Z
deflate 1000000 times
real 0m0.446s
user 0m0.361s
sys 0m0.060s
@abalib do you remember why this optimization was disabled by default?
Finally, we made this cached vas handle behavior configurable here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L488. Trying to refresh my memory, I think the main reason is because of when a process migrates from one chip to another, reused vas window pinned the process to the nx-gzip on the original chip where open was made. Functionally there is no problem; it still works. But causes imbalanced in nx-gzip use, and as well as degraded performance for reaching over the numa links. For example, a process runs on chip 1 and memory also allocated from chip 1, but since it's connected to nx-gzip on chip 0 the input/output to nx-gzip must go across two chips.
Ok, so this was actually answered here, my bad 🤦🏼
On github web, I click on this line https://github.com/libnxz/power-gzip/blame/develop/lib/nx_zlib.c#L76 and then click blame. It gives me this commit https://github.com/libnxz/power-gzip/commit/32dabcb5b25cf86f77ed701fcf11024b3073448f For whatever reason, Yi Li disabled the optimization as default.
I would enable it unless it breaks some app or hurts performance (i.e. because of nx-gzip hot spots.) A heuristic fix might be doing a real open/close every N times, where N is let's say 100 (pick your own number). That way you amortize the overhead of open/close over 100 InflateInit calls e.g. 1% overhead vs 100%. It will also help fix the numa problem. Suppose the process migrated to another chip. After 99 InflateInit calls the real open/close will happen, and afterwards the library will start using the local nx-gzip
Running make check
with NX_GZIP_DIS_SAVDEVP=0
works fine, so that's a good sign.
And this heuristic is a good idea, we could even make it modifiable through the config file to allow some tuning.
@kalshett This has been fixed by #107. The fix is available in the develop
branch.
samples/zpipe-repeat-test
has been refactored and renamed as samples/bench_initend
. Now on my P10 LPAR it stops after ~1min:
$ make -C samples/ bench_initend
[...]
$ LD_PRELOAD=./lib/libnxz.so /usr/bin/time samples/bench_initend
Avg. deflateInit (us),Avg. deflateEnd (us),Avg. inflateInit (us),Avg. inflateEnd (us)
296.887,0.064,296.948,0.039
0.28user 0.00system 0:59.42elapsed 0%CPU (0avgtext+0avgdata 3328maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
It is working fine now...We can close this issue. [root@d87p1 samples]# time ./bench_initend Avg. deflateInit (us),Avg. deflateEnd (us),Avg. inflateInit (us),Avg. inflateEnd (us) 3.379,6.790,0.050,0.014
real 0m1.027s user 0m0.150s sys 0m0.878s
PowerNV:
create 5gb file
dd if=/dev/urandom of=5gb-file bs=1000000000 count=5 dd: warning: partial read (33554431 bytes); suggest iflag=fullblock 0+5 records in 0+5 records out 167772155 bytes (168 MB, 160 MiB) copied, 0.48477 s, 346 MB/s
[root@ltc-wspoon12 samples]# ./zpipe-repeat-test < 5gb-file > junk.Z
deflate 1000000 times
PowerVM:
create 5gb-file (same dd command as PowerNV)
[root@d87p1 samples]# ./zpipe-repeat-test < 5gb-file > junk.Z
It never ends even after 4/5 hrs...