Closed ozhuraki closed 2 years ago
I think the 5.11 kernel may still have the driver bug that causes wq mode change to not happen. I believe that has been fixed in later kernels. Can you see if a 5.12 kernel works any better?
"threshold" is only for shared wqs. Try without the "--threshold" option.
Thanks @ramesh-thomas. I was reading the log wrong and was mistaken incorrect mode rather than threshold.
@ramesh-thomas
Try without the "--threshold" option.
# accel-config list [ ] # accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0 # accel-config config-engine --group-id=0 dsa0/engine0.0 # accel-config enable-device dsa0 failed in dsa0 enabled 0 device(s) out of 1 Error[ 0x15] dsa0: Invalid group config: lack of wq or engines # accel-config list [ ] # accel-config load-config -c dsa0.conf # accel-config enable-device dsa0 enabled 1 device(s) out of 1 # accel-config enable-wq dsa0/wq0.0 enabled 1 wq(s) out of 1 #
@davejiang I think the 5.11 kernel may still have the driver bug that causes wq mode change to not happen. I believe that has been fixed in later kernels. Can you see if a 5.12 kernel works any better?
OK, thanks, we will try that. In principle, it works in 5.11 through load-configuration.
@ozhuraki you don't need to switch kernel. I misread your earlier log.
Can you try rebooting or resetting by unloading and reloading idxd module? Those commands worked for me.
@ramesh-thomas
Can you try rebooting or resetting by unloading and reloading idxd module? Those commands worked for me.
After rebooting "config-wq" works, but only once. On reloading:
# modprobe -r idxd
# modprobe idxd
# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[ 0x16] dsa0: Invalid group config: wq misconfigured
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
#
Can you try putting group id as the first parameter? I wonder if there's an ordering issue for whatever reason. Something like: accel-config config-wq --group-id=0 --mode=dedicated --wq-size=16 --type=user --name="mywq" --priority=10 --block-on-fault=1 dsa0/wq0.0
@davejiang
Can you try putting group id as the first parameter?
# accel-config list [ ] # accel-config config-wq --group-id=0 --type=user --mode=dedicated --name="dsa0.0" --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0 # accel-config config-engine --group-id=0 dsa0/engine0.0 # accel-config enable-device dsa0 failed in dsa0 enabled 0 device(s) out of 1 Error[ 0x16] dsa0: Invalid group config: wq misconfigured # accel-config list [ ] # accel-config load-config -c dsa0.conf # accel-config enable-device dsa0 enabled 1 device(s) out of 1 # accel-config enable-wq dsa0/wq0.0 enabled 1 wq(s) out of 1 #
Can you attach the dsa0.conf? Also, given it's a dedicated wq, can you try the latest upstream kernel? 5.15-rc5 would be great. Thanks!
@davejiang
Can you attach the dsa0.conf?
# cat dsa0.conf [ { "dev":"dsa0", "token_limit":0, "groups":[ { "dev":"group0.0", "tokens_reserved":0, "use_token_limit":0, "tokens_allowed":8, "grouped_workqueues":[ { "dev":"wq0.0", "mode":"dedicated", "size":16, "group_id":0, "priority":10, "block_on_fault":1, "type":"user", "name":"dsa0.0", "threshold":15 } ], "grouped_engines":[ { "dev":"engine0.0", "group_id":0 } ] }, { "dev":"group0.1", "tokens_reserved":0, "use_token_limit":0, "tokens_allowed":8, "grouped_workqueues":[ { "dev":"wq0.1", "mode":"dedicated", "size":16, "group_id":1, "priority":10, "block_on_fault":1, "type":"user", "name":"dsa0.1", "threshold":15 } ], "grouped_engines":[ { "dev":"engine0.1", "group_id":1 } ] }, { "dev":"group0.2", "tokens_reserved":0, "use_token_limit":0, "tokens_allowed":8, "grouped_workqueues":[ { "dev":"wq0.2", "mode":"dedicated", "size":16, "group_id":2, "priority":10, "block_on_fault":1, "type":"user", "name":"dsa0.2", "threshold":15 } ], "grouped_engines":[ { "dev":"engine0.2", "group_id":2 } ] }, { "dev":"group0.3", "tokens_reserved":0, "use_token_limit":0, "tokens_allowed":8, "grouped_workqueues":[ { "dev":"wq0.3", "mode":"dedicated", "size":16, "group_id":3, "priority":10, "block_on_fault":1, "type":"user", "name":"dsa0.3", "threshold":15 } ], "grouped_engines":[ { "dev":"engine0.3", "group_id":3 } ] } ] } ] #
You don't have a conf file that only configures a single wq same as the commandline?
Can you do a 'accel-config list -i' after you have configured with commandline? Curious what accel-config has configured so far after commandline.
@davejiang
You don't have a conf file that only configures a single wq same as the commandline?
Reducing the conf to fewer than 3 workqueus doesn't work, i.e. such configuration fails to load through "load-configuration".
Can you do a 'accel-config list -i' after you have configured with commandline?
# accel-config list [ ] # accel-config list --idle | jq '.[].dev' | grep dsa "dsa0" "dsa1" "dsa2" "dsa3" "dsa4" "dsa5" "dsa6" "dsa7" # accel-config list --idle | jq '.[0]' { "dev": "dsa0", "token_limit": 0, "max_groups": 4, "max_work_queues": 8, "max_engines": 4, "work_queue_size": 128, "numa_node": 0, "op_cap": [ "0x1003f03ff", "0", "0", "0" ], "gen_cap": "0x40915f010f", "version": "0x100", "state": "disabled", "max_tokens": 96, "max_batch_size": 1024, "max_transfer_size": 2147483648, "configurable": 1, "pasid_enabled": 1, "cdev_major": 234, "clients": 0, "groups": [ { "dev": "group0.0", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_engines": [ { "dev": "engine0.0", "group_id": 0 } ] }, { "dev": "group0.1", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_engines": [ { "dev": "engine0.1", "group_id": 1 } ] }, { "dev": "group0.2", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_engines": [ { "dev": "engine0.2", "group_id": 2 } ] }, { "dev": "group0.3", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_engines": [ { "dev": "engine0.3", "group_id": 3 } ] } ], "ungrouped workqueues": [ { "dev": "wq0.0", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.1", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.2", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.3", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.4", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.5", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.6", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.7", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 } ] } # accel-config load-config -c dsa0.conf # accel-config enable-device dsa0 enabled 1 device(s) out of 1 # accel-config enable-wq dsa0/wq0.0 enabled 1 wq(s) out of 1 # accel-config list --idle | jq '.[0]' { "dev": "dsa0", "token_limit": 0, "max_groups": 4, "max_work_queues": 8, "max_engines": 4, "work_queue_size": 128, "numa_node": 0, "op_cap": [ "0x1003f03ff", "0", "0", "0" ], "gen_cap": "0x40915f010f", "version": "0x100", "state": "enabled", "max_tokens": 96, "max_batch_size": 1024, "max_transfer_size": 2147483648, "configurable": 1, "pasid_enabled": 1, "cdev_major": 234, "clients": 0, "groups": [ { "dev": "group0.0", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_workqueues": [ { "dev": "wq0.0", "mode": "dedicated", "size": 16, "group_id": 0, "priority": 10, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "cdev_minor": 0, "type": "user", "name": "dsa0.0", "threshold": 0, "ats_disable": 0, "state": "enabled", "clients": 0 } ], "grouped_engines": [ { "dev": "engine0.0", "group_id": 0 } ] }, { "dev": "group0.1", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_workqueues": [ { "dev": "wq0.1", "mode": "dedicated", "size": 16, "group_id": 1, "priority": 10, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "user", "name": "dsa0.1", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 } ], "grouped_engines": [ { "dev": "engine0.1", "group_id": 1 } ] }, { "dev": "group0.2", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_workqueues": [ { "dev": "wq0.2", "mode": "dedicated", "size": 16, "group_id": 2, "priority": 10, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "user", "name": "dsa0.2", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 } ], "grouped_engines": [ { "dev": "engine0.2", "group_id": 2 } ] }, { "dev": "group0.3", "tokens_reserved": 0, "use_token_limit": 0, "tokens_allowed": 8, "traffic_class_a": 0, "traffic_class_b": 1, "grouped_workqueues": [ { "dev": "wq0.3", "mode": "dedicated", "size": 16, "group_id": 3, "priority": 10, "block_on_fault": 1, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "user", "name": "dsa0.3", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 } ], "grouped_engines": [ { "dev": "engine0.3", "group_id": 3 } ] } ], "ungrouped workqueues": [ { "dev": "wq0.4", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.5", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.6", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 }, { "dev": "wq0.7", "mode": "shared", "size": 0, "priority": 0, "block_on_fault": 0, "max_batch_size": 1024, "max_transfer_size": 2147483648, "type": "none", "name": "", "threshold": 0, "ats_disable": 0, "state": "disabled", "clients": 0 } ] } # # accel-config list --idle | jq '.[].dev' | grep dsa "dsa0" "dsa1" "dsa2" "dsa3" "dsa4" "dsa5" "dsa6" "dsa7"
I find the engines all pre-assigned to each group to be strange. Can you reboot, run that single accel-config config-wq, and then do the filtered accel-config list --idle please? Thanks!
@davejiang
Can you reboot
we find this rebooting a bit strange. would rmmod/modprobe idxd be enough as suggested by @ramesh-thomas earlier:
Can you try rebooting or resetting by unloading and reloading idxd module?
@davejiang
Can you reboot
There are multiple users, unfortunately, this is problematic. Resetting by unloading/loading the idxd module was already tried https://github.com/intel/idxd-config/issues/11#issuecomment-944297432. Are there any other ways to reset the DSA HW?
While discovering this, an earlier observation is that "config-wq", "config-engine", "enable-device", "enable-wq", "disable..." works only a limited number of times after a reboot and was reproducible in multiple physical setups.
Since the identical configuration can be succesfully loaded and enabled through "load-configuration", is the problem in the order of setting the sysfs entries by accel-config in case of "config-wq" / "config-engine" / "enable-device"?
You can unload module. But I really want a clean slate to see if this is a problem or something else caused this. Also, the 5.11 kernel is pretty old consider 5.15 is about to be released. The 5.11 probably has a lot of bugs that are fixed in later kernels. Unless you are reproducing a bug on the latest upstream kernel, there isn't much we can do. BTW, what silicon stepping are you using?