intel / idxd-config

Accel-config / libaccel-config
Other
56 stars 35 forks source link

Provide an option to run DSA, IAA tests without device reset #38

Closed ozhuraki closed 7 months ago

mythi commented 1 year ago

@ozhuraki same as #36?

ozhuraki commented 1 year ago

@mythi This is related but slighty different, i.e. when tests are run on an already configured wq, the device is reset which trigger EROFS in some cases (kernel version, etc): https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/demo/accel-config-demo/idxd-reset.patch

ramesh-thomas commented 1 year ago

@xinzhanz please see whether this can be changed.

xinzhanz commented 1 year ago

Don't find the difference when apply this patch. When there is wq is enabled, run make check libaccfg, test is skip whatever the patch is added or not.

mythi commented 9 months ago

Without the patch, this is what we see:

# ./dsa_test -w 1 -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 0: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 10: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 12: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 14: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 2: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 4: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 6: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 8: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 1: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 11: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 13: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 15: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 3: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 5: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 7: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 9: add_dev() failed
[error] No usable wq found

when adding the patch, everything works OK:

# ./dsa_test -w 1 -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0
[ info] alloc wq 1 shared size 16 addr 0x7f2328f76000 batch sz 0x400 xfer sz 0x80000000
[ info] testmemory: opcode 16 len 0x100000 tflags 0x1 num_desc 1
[ info] preparing descriptor for crcgen
[ info] Submitted all crcgen jobs
[ info] verifying task result for 0x55c7f7b1ecf0
expected crc = 1419fe20
mythi commented 9 months ago

accel-config version v4.1, linux 6.2

mythi commented 9 months ago

@xinzhanz any suggestions we could try or what might be missing?

xinzhanz commented 9 months ago

As I said, I didn't find the function difference. You can ask Ramesh to merge it.

mythi commented 9 months ago

Do we know what's causing the error we see when running these tests in a container?

mythi commented 8 months ago

Do we know what's causing the error we see when running these tests in a container?

@xinzhanz @ramesh-thomas any suggestions what to try?

xinzhanz commented 8 months ago

Could you remove "-w 1" and try, ./dsa_test -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0? If it is ok, please send your patch out for merging.

ramesh-thomas commented 8 months ago

@mythi you can either create a pull request or send the patch to accel-config@lists.linuxfoundation.org after checking the above.

mythi commented 8 months ago

I'm not comfortable sending the patch because it just ignores the error and we don't know why the error happens. Is there some writing to sysfs during the tests?

ramesh-thomas commented 8 months ago

I'm not comfortable sending the patch because it just ignores the error and we don't know why the error happens. Is there some writing to sysfs during the tests?

Yes, I agree. I did not know the patch was the one linked above adding a check for EROFS

Can you please check the cmd_status attribute? ls -l /sys/bus/dsa/devices/dsa0/cmd_status

This has never happened. Are you having a very old kernel and idxd driver?

Thanks

mythi commented 8 months ago

This has never happened. Are you having a very old kernel and idxd driver?

my log was from 6.2. Is that old? :-) One reason I can think of is is that sysfs is often mounted read-only to containers. That's the reason I was asking if the tests need to write anything there.

ramesh-thomas commented 8 months ago

Yes, it clears the command status attribute at the start. How can anything be configured if the attributes are made read-only?

mythi commented 8 months ago

WQ setup is done separately by the node admin. The app container just uses the pre-configured queues

mythi commented 8 months ago

WQ setup is done separately by the node admin. The app container just uses the pre-configured queues

@ramesh-thomas

What:       /sys/bus/dsa/devices/dsa<m>/cmd_status
Date:       Aug 28, 2020
KernelVersion:  5.10.0
Contact:    dmaengine@vger.kernel.org
Description:    The last executed device administrative command's status/error.
        Also last configuration error overloaded.
        Writing to it will clear the status.

suggests that the apps are not supposed to write to it. Is my understanding correct?

ramesh-thomas commented 8 months ago

Description: The last executed device administrative command's status/error. Also last configuration error overloaded. Writing to it will clear the status.

suggests that the apps are not supposed to write to it. Is my understanding correct?

I did not understand your question. Writing to it clears the status as the description states. We need to clear the status to avoid reading any stale value.

mythi commented 8 months ago

We need to clear the status to avoid reading any stale value.

My question was if these 'device administrative commands' are related to WQ config which in our case are not done by the application container

mythi commented 8 months ago

I've not looked deep into the code. Is the cmd_status writing done in the test code? Can it be done conditionally or when the reset is really needed (read returns an error).

ramesh-thomas commented 7 months ago

Can you please try the latest in pending branch? I have added a fix.

mythi commented 7 months ago

Can you please try the latest in pending branch? I have added a fix.

OK, checking.

mythi commented 7 months ago

Can you please try the latest in pending branch? I have added a fix.

@ramesh-thomas our CI gives green with the patch so it seems to work well, thanks a lot!

ramesh-thomas commented 7 months ago

Fix added in https://github.com/intel/idxd-config/releases/tag/accel-config-v4.1.4