libnxz / power-gzip

POWER NX zlib compliant library
23 stars 18 forks source link

test_multithread_stress failing #198

Open kif opened 1 year ago

kif commented 1 year ago

Dear developers,

We are facing some issues when using threads together with hardware gzip-compression. This is consistent with the failing of the test called test_multithread_stress in the test suite of the user-space library. Computer used are Power9 server (AC922), with 2 processors running Ubuntu 20.04, thus with a kernel 5.4. Other hardware equipment (GPFS filesystem) prevents us form upgrading the kernel to a more recent version. Thus the kernel has been patched to backport the hardware compression with this series of patches: https://github.com/tuliom/linux-debian/tree/nxgzip/5.6/debian/patches/features/powerpc/nxgzip

I wonder if there is a bug in the user-space library or if some locks are missing in the patch we applied ... test-suite.log

Thanks in advance for your guidance.

Jerome

kif commented 1 year ago

I just tried to change the kernel of (one of) the computer, upgrading the 5.4+NX patch with 5.10 in a configuration provided by debian11. There, all tests passe (but GPFS is broken, this is why we would like to stick to the version 5.4).

This means there are some code differences between 5.4 and 5.10 which enables the multithreaded hardware compression ... or maybe a kernel option (but it is unlikely). Does anybody have an idea of what this difference could come from and build a patch out of it ?

Thanks for your lights.

RajalakshmiSR commented 1 year ago

@hmyneni Do you have any comments on this?

hmyneni commented 1 year ago

Assume I gave that kernel build applying patches on Ubuntu 20.04. But we have several fixes made in to upstream later.

Without knowing the actual reason of failure from test case, not sure what caused the issue. But as you mentioned working with 5.10 kernel, the following patches should be applied (commits from the upstream tree):

commit 45591da765885f7320a111d290b3a28a23eed359 -- Just build issue Author: Stephen Rothwell sfr@canb.auug.org.au Date: Wed Apr 22 15:41:29 2020 +1000

powerpc/vas: Include linux/types.h in uapi/asm/vas-api.h

commit f5678e7f2ac31c270334b936352f0ef2fe7dd2b3 -- documentation fix Author: Christoph Hellwig hch@lst.de Date: Wed Jun 10 18:42:06 2020 -0700

kernel: better document the use_mm/unuse_mm API contract

commit 6068e1a4427e88f5cc62f238d1baf94a8b824ef4 Author: Haren Myneni haren@linux.ibm.com Date: Fri Jul 10 16:47:19 2020 -0700

powerpc/vas: Report proper error code for address translation failure

commit f0479c4bcbd92d1a457d4a43bcab79f29d11334a Author: Haren Myneni haren@linux.ibm.com Date: Fri Jul 10 16:49:58 2020 -0700

selftests/powerpc: Use proper error code to check fault address

Mostly commit 6068e1a4427e88f5cc62f238d1baf94a8b824ef4 should fix this issue as it returns CSB_CC_FAULT_ADDRESS (250) upon NX fault which is used by the user space.

commit description: "powerpc/vas: Report proper error code for address translation failure

P9 DD2 NX workbook (Table 4-36) says DMA controller uses CC=5
internally for translation fault handling. NX reserves CC=250 for
OS to notify user space when NX encounters address translation
failure on the request buffer. Not an issue in earlier releases
as NX does not get faults on kernel addresses.

This patch defines CSB_CC_FAULT_ADDRESS(250) and updates CSB.CC with
this proper error code for user space."

But please apply all listed patches so that sync with upstream version (powerNV support) and should not cause with the 5.4 build.

kif commented 1 year ago

Thanks a lot, it looks some synchronization stuff was added to the kernel in f5678e7f2ac31c270334b936352f0ef2fe7dd2b3 I am patching the kernel to try to make it work

kif commented 1 year ago

It works! The code is not yet clean but tests are passing. I am waiting the feed-back from the production team to try to summarize here what I did. Basically the kernel 5.8 introduced some locks for i/o but by 5.10 those locks where moved in different files which changes very much the structure of the kernel compared to kernel 5.4. so I did backport those function with signature compatible with 5.4...

alejandrohomsp commented 1 year ago

Hello, I am a colleague working with @kif. I'd just like to thank you for the information. I confirm that indeed the patch 6068e1a4 fixed the issue reported by @kif, which was also seen on similar computers running DAQ software.

kif commented 1 year ago

I have summarized all patches we use to enable "fully" the power-gzip at:

https://github.com/kif/power-gzip/tree/kernel-patches/kernel-patches/5.4-ubuntu-20.04

Do you want me to open a pull-request ?