zpipe-repeat-test hangs on PowerVM where same works on PowerNV

kalshett commented 3 years ago

PowerNV:

create 5gb file

dd if=/dev/urandom of=5gb-file bs=1000000000 count=5 dd: warning: partial read (33554431 bytes); suggest iflag=fullblock 0+5 records in 0+5 records out 167772155 bytes (168 MB, 160 MiB) copied, 0.48477 s, 346 MB/s

[root@ltc-wspoon12 samples]# ./zpipe-repeat-test < 5gb-file > junk.Z

deflate 1000000 times

PowerVM:

create 5gb-file (same dd command as PowerNV)

[root@d87p1 samples]# ./zpipe-repeat-test < 5gb-file > junk.Z

It never ends even after 4/5 hrs...

mscastanho commented 3 years ago

I can reproduce the issue on a P10 LPAR with:

~/src/power-gzip/samples$ gcc -g -O3 -o zpipe-repeat-test zpipe-repeat-test.c ../lib/libnxz.a -ldl
~/src/power-gzip/samples$ echo "not so useful text" > junk
~/src/power-gzip/samples$ tar -cf junk.tar junk
~/src/power-gzip/samples$ ./zpipe-repeat-test < junk.tar > junk.Z

Looks like the code is actually progressing, but at a really slow pace. perf indicates it spends ~82% of the time on the close call from nx_function_end.

mscastanho commented 3 years ago

Related info: the test tries 1M iterations, I reduced it to 1k to force it to end sooner. It is MUCH slower on P10 LPAR.

P9 baremetal:

$ time ./zpipe-repeat-test < junk.tar > junk.Z
deflate 1000 times

real    0m0.071s
user    0m0.003s
sys     0m0.066s

P10 LPAR:

$ time ./zpipe-repeat-test < junk.tar > junk.Z
iteration 0
deflate 1000 times

real    0m33.233s
user    0m0.126s
sys     0m0.000s

Let me correct my last comment: the program seems to be spending >80% of the time in the kernel, both for deflateInit and deflateEnd. Most of it on vas_allocate_window and vas_deallocate_window.

abalib commented 3 years ago

There should be hack here or about to do that nx_function_begin and nx_function_end only once... https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L492

Not claiming it to be best hack or even correct. But it used to work for me.

abalib commented 3 years ago

The problem is as you observed, those expensive system calls. The idea behind the so called hack was to keep the vas window open once open therefore pay the price once. And subsequent inflateinit deflateinit would return quickly reusing the existing open window. I am guessing this is where the problem and the solution are

mscastanho commented 3 years ago

If so, this is probably the same as #30. I'll give the hack a try next week.

hmyneni commented 3 years ago

For each window open (vas_allocate_window()), 2 HCALLs (open and modify window calls) are issued and irq setup. Whereas 1 close HCALL is issued for window close (vas_deallocate_window()). So the main difference over powerNV is these HCALLs to the hypervisor. I think these HCALLs are taking more time and we need to measure these HCALLs performance to find the bottleneck. Looks like deallocate window HCALL is consuming more time.

abalib commented 3 years ago

If so, this is probably the same as #30. I'll give the hack a try next week.

Couple notes on this subject:

Main idea was to keep the vas window open within the same PID once open. If the application repeatedly does inflateInit/End or same with deflate, then the hack will keep the vas window open to eliminate the overhead. You'd expect apps to do the Init/End calls once but some don't, and even the zlib itself doesn't, for example see compress() and uncompress() functions which return no handle to keep around. Therefore, to eliminate overhead there was a need to keep the vas window open under the covers as long as the PID existed. e.g., https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L75

If the process forked, kernel did not copy the open vas handle to the new process as I recall. Therefore, there is a fork check here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L492 and then doing a real open if the process forked.

I tried to implement mutual exclusion while multiple threads would be doing open and close here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L490. I cannot say that the use of atomic_fetch_add is the correct approach or even correct. I was trying to minimize the inflate/deflateInit overheads as much as possible (important for small block performance). And I was under the impression that pthread based exclusion overhead would be higher than an atomic op which map to a single instruction --but perhaps a false impression.

Finally, we made this cached vas handle behavior configurable here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L488. Trying to refresh my memory, I think the main reason is because of when a process migrates from one chip to another, reused vas window pinned the process to the nx-gzip on the original chip where open was made. Functionally there is no problem; it still works. But causes imbalanced in nx-gzip use, and as well as degraded performance for reaching over the numa links. For example, a process runs on chip 1 and memory also allocated from chip 1, but since it's connected to nx-gzip on chip 0 the input/output to nx-gzip must go across two chips.

The reuse handle optimizations is for small data blocks. For large data blocks if the user disables this optimization (which eliminate the numa problem) the performance may be ok because the open/close overhead becomes smaller part of the overall run time.

abalib commented 3 years ago

causes imbalanced in nx-gzip use, and as well as degraded performance for reaching over the numa links.

By the way, I should have a load balancing patch in one of the branches. Basically, library opens a vas-window on all chips. In the round robin mode, it sends jobs to all nx-gzip units, round-robin as the name implies. The idea is to prevent hot-spots e.g. what if all processes are queued at a single nx-gzip? However round robin still pays the numa penalty. In the feedback directed mode, the library measures the delays and steers jobs to the unit that has given the least delay in the past few jobs. That usually means choosing the local unit.

mscastanho commented 3 years ago

So after reading the code, looks like the hack is already in place, right? I ran the same test on PowerVM but now with VAS window reuse, and it's MUCH faster.

P10 LPAR

~/src/power-gzip/samples> time NX_GZIP_DIS_SAVDEVP=0 NX_GZIP_LOGFILE=$(pwd)/nx.log ./zpipe-repeat-test < junk.tar > junk.Z
deflate 1000000 times

real    0m0.446s
user    0m0.361s
sys     0m0.060s

@abalib do you remember why this optimization was disabled by default?

mscastanho commented 3 years ago

Finally, we made this cached vas handle behavior configurable here https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L488. Trying to refresh my memory, I think the main reason is because of when a process migrates from one chip to another, reused vas window pinned the process to the nx-gzip on the original chip where open was made. Functionally there is no problem; it still works. But causes imbalanced in nx-gzip use, and as well as degraded performance for reaching over the numa links. For example, a process runs on chip 1 and memory also allocated from chip 1, but since it's connected to nx-gzip on chip 0 the input/output to nx-gzip must go across two chips.

Ok, so this was actually answered here, my bad 🤦🏼

abalib commented 3 years ago

On github web, I click on this line https://github.com/libnxz/power-gzip/blame/develop/lib/nx_zlib.c#L76 and then click blame. It gives me this commit https://github.com/libnxz/power-gzip/commit/32dabcb5b25cf86f77ed701fcf11024b3073448f For whatever reason, Yi Li disabled the optimization as default.

abalib commented 3 years ago

I would enable it unless it breaks some app or hurts performance (i.e. because of nx-gzip hot spots.) A heuristic fix might be doing a real open/close every N times, where N is let's say 100 (pick your own number). That way you amortize the overhead of open/close over 100 InflateInit calls e.g. 1% overhead vs 100%. It will also help fix the numa problem. Suppose the process migrated to another chip. After 99 InflateInit calls the real open/close will happen, and afterwards the library will start using the local nx-gzip

mscastanho commented 3 years ago

Running make check with NX_GZIP_DIS_SAVDEVP=0 works fine, so that's a good sign.

And this heuristic is a good idea, we could even make it modifiable through the config file to allow some tuning.

mscastanho commented 3 years ago

@kalshett This has been fixed by #107. The fix is available in the develop branch.

samples/zpipe-repeat-test has been refactored and renamed as samples/bench_initend. Now on my P10 LPAR it stops after ~1min:

$ make -C samples/ bench_initend
[...]

$ LD_PRELOAD=./lib/libnxz.so /usr/bin/time samples/bench_initend 
Avg. deflateInit (us),Avg. deflateEnd (us),Avg. inflateInit (us),Avg. inflateEnd (us)
296.887,0.064,296.948,0.039
0.28user 0.00system 0:59.42elapsed 0%CPU (0avgtext+0avgdata 3328maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps

kalshett commented 2 years ago

It is working fine now...We can close this issue. [root@d87p1 samples]# time ./bench_initend Avg. deflateInit (us),Avg. deflateEnd (us),Avg. inflateInit (us),Avg. inflateEnd (us) 3.379,6.790,0.050,0.014

real 0m1.027s user 0m0.150s sys 0m0.878s

libnxz / power-gzip

zpipe-repeat-test hangs on PowerVM where same works on PowerNV #104

create 5gb file

create 5gb-file (same dd command as PowerNV)