Closed ozhuraki closed 7 months ago
@mythi This is related but slighty different, i.e. when tests are run on an already configured wq, the device is reset which trigger EROFS in some cases (kernel version, etc): https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/demo/accel-config-demo/idxd-reset.patch
@xinzhanz please see whether this can be changed.
Don't find the difference when apply this patch. When there is wq is enabled, run make check libaccfg, test is skip whatever the patch is added or not.
Without the patch, this is what we see:
# ./dsa_test -w 1 -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 0: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 10: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 12: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 14: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 2: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 4: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 6: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 8: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 1: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 11: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 13: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 15: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 3: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 5: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 7: add_dev() failed
libaccfg: add_device: Failed resetting cmd status -30
libaccfg: __sysfs_device_parse: 9: add_dev() failed
[error] No usable wq found
when adding the patch, everything works OK:
# ./dsa_test -w 1 -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0
[ info] alloc wq 1 shared size 16 addr 0x7f2328f76000 batch sz 0x400 xfer sz 0x80000000
[ info] testmemory: opcode 16 len 0x100000 tflags 0x1 num_desc 1
[ info] preparing descriptor for crcgen
[ info] Submitted all crcgen jobs
[ info] verifying task result for 0x55c7f7b1ecf0
expected crc = 1419fe20
accel-config version v4.1, linux 6.2
@xinzhanz any suggestions we could try or what might be missing?
As I said, I didn't find the function difference. You can ask Ramesh to merge it.
Do we know what's causing the error we see when running these tests in a container?
Do we know what's causing the error we see when running these tests in a container?
@xinzhanz @ramesh-thomas any suggestions what to try?
Could you remove "-w 1" and try, ./dsa_test -l 1048576 -o 0x10 -f 0x1 t200 -d dsa4/wq2.0? If it is ok, please send your patch out for merging.
@mythi you can either create a pull request or send the patch to accel-config@lists.linuxfoundation.org after checking the above.
I'm not comfortable sending the patch because it just ignores the error and we don't know why the error happens. Is there some writing to sysfs during the tests?
I'm not comfortable sending the patch because it just ignores the error and we don't know why the error happens. Is there some writing to sysfs during the tests?
Yes, I agree. I did not know the patch was the one linked above adding a check for EROFS
Can you please check the cmd_status attribute? ls -l /sys/bus/dsa/devices/dsa0/cmd_status
This has never happened. Are you having a very old kernel and idxd driver?
Thanks
This has never happened. Are you having a very old kernel and idxd driver?
my log was from 6.2. Is that old? :-) One reason I can think of is is that sysfs
is often mounted read-only to containers. That's the reason I was asking if the tests need to write anything there.
Yes, it clears the command status attribute at the start. How can anything be configured if the attributes are made read-only?
WQ setup is done separately by the node admin. The app container just uses the pre-configured queues
WQ setup is done separately by the node admin. The app container just uses the pre-configured queues
@ramesh-thomas
What: /sys/bus/dsa/devices/dsa<m>/cmd_status
Date: Aug 28, 2020
KernelVersion: 5.10.0
Contact: dmaengine@vger.kernel.org
Description: The last executed device administrative command's status/error.
Also last configuration error overloaded.
Writing to it will clear the status.
suggests that the apps are not supposed to write to it. Is my understanding correct?
Description: The last executed device administrative command's status/error. Also last configuration error overloaded. Writing to it will clear the status.
suggests that the apps are not supposed to write to it. Is my understanding correct?
I did not understand your question. Writing to it clears the status as the description states. We need to clear the status to avoid reading any stale value.
We need to clear the status to avoid reading any stale value.
My question was if these 'device administrative commands' are related to WQ config which in our case are not done by the application container
I've not looked deep into the code. Is the cmd_status
writing done in the test code? Can it be done conditionally or when the reset is really needed (read returns an error).
Can you please try the latest in pending branch? I have added a fix.
Can you please try the latest in pending branch? I have added a fix.
OK, checking.
Can you please try the latest in pending branch? I have added a fix.
@ramesh-thomas our CI gives green with the patch so it seems to work well, thanks a lot!
@ozhuraki same as #36?