intel / idxd-config

Accel-config / libaccel-config
Other
59 stars 35 forks source link

Unable to configure user/dedicated workqueue with the "config-wq" command #11

Closed ozhuraki closed 2 years ago

ozhuraki commented 2 years ago
# accel-config -v
3.4.2.git63991cc9
# uname -rv
5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021
# cat /proc/cmdline
[...] intel_iommu=on,sm_on
# id
uid=0(root) gid=0(root) groups=0(root)
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# accel-config disable-device dsa0
disabled 1 device(s) out of 1
# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 --threshold=15 dsa0/wq0.0
libaccfg: accfg_wq_set_threshold: wq0.0: write failed: Invalid argument
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 
# cat dsa0.conf
[
  {
    "dev":"dsa0",
    "token_limit":0,
    "groups":[
      {
        "dev":"group0.0",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.0",
            "mode":"dedicated",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.0",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.0",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group0.1",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.1",
            "mode":"dedicated",
            "size":16,
            "group_id":1,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.1",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.1",
            "group_id":1
          }
        ]
      },
      {
        "dev":"group0.2",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.2",
            "mode":"dedicated",
            "size":16,
            "group_id":2,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.2",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.2",
            "group_id":2
          }
        ]
      },
      {
        "dev":"group0.3",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.3",
            "mode":"dedicated",
            "size":16,
            "group_id":3,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.3",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.3",
            "group_id":3
          }
        ]
      }
    ]
  }
]
# 
davejiang commented 2 years ago

I think the 5.11 kernel may still have the driver bug that causes wq mode change to not happen. I believe that has been fixed in later kernels. Can you see if a 5.12 kernel works any better?

ramesh-thomas commented 2 years ago

"threshold" is only for shared wqs. Try without the "--threshold" option.

davejiang commented 2 years ago

Thanks @ramesh-thomas. I was reading the log wrong and was mistaken incorrect mode rather than threshold.

ozhuraki commented 2 years ago

@ramesh-thomas

Try without the "--threshold" option.

# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[      0x15] dsa0: Invalid group config: lack of wq or engines
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 

@davejiang I think the 5.11 kernel may still have the driver bug that causes wq mode change to not happen. I believe that has been fixed in later kernels. Can you see if a 5.12 kernel works any better?

OK, thanks, we will try that. In principle, it works in 5.11 through load-configuration.

davejiang commented 2 years ago

@ozhuraki you don't need to switch kernel. I misread your earlier log.

ramesh-thomas commented 2 years ago

Can you try rebooting or resetting by unloading and reloading idxd module? Those commands worked for me.

ozhuraki commented 2 years ago

@ramesh-thomas

Can you try rebooting or resetting by unloading and reloading idxd module? Those commands worked for me.

After rebooting "config-wq" works, but only once. On reloading:

# modprobe -r idxd
# modprobe idxd
# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[      0x16] dsa0: Invalid group config: wq misconfigured
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 
davejiang commented 2 years ago

Can you try putting group id as the first parameter? I wonder if there's an ordering issue for whatever reason. Something like: accel-config config-wq --group-id=0 --mode=dedicated --wq-size=16 --type=user --name="mywq" --priority=10 --block-on-fault=1 dsa0/wq0.0

ozhuraki commented 2 years ago

@davejiang

Can you try putting group id as the first parameter?

# accel-config list
[
]
# accel-config config-wq --group-id=0 --type=user --mode=dedicated --name="dsa0.0" --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[      0x16] dsa0: Invalid group config: wq misconfigured
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 
davejiang commented 2 years ago

Can you attach the dsa0.conf? Also, given it's a dedicated wq, can you try the latest upstream kernel? 5.15-rc5 would be great. Thanks!

ozhuraki commented 2 years ago

@davejiang

Can you attach the dsa0.conf?

# cat dsa0.conf
[
{
"dev":"dsa0",
"token_limit":0,
"groups":[
{
"dev":"group0.0",
"tokens_reserved":0,
"use_token_limit":0,
"tokens_allowed":8,
"grouped_workqueues":[
{
"dev":"wq0.0",
"mode":"dedicated",
"size":16,
"group_id":0,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dsa0.0",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engine0.0",
"group_id":0
}
]
},
{
"dev":"group0.1",
"tokens_reserved":0,
"use_token_limit":0,
"tokens_allowed":8,
"grouped_workqueues":[
{
"dev":"wq0.1",
"mode":"dedicated",
"size":16,
"group_id":1,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dsa0.1",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engine0.1",
"group_id":1
}
]
},
{
"dev":"group0.2",
"tokens_reserved":0,
"use_token_limit":0,
"tokens_allowed":8,
"grouped_workqueues":[
{
"dev":"wq0.2",
"mode":"dedicated",
"size":16,
"group_id":2,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dsa0.2",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engine0.2",
"group_id":2
}
]
},
{
"dev":"group0.3",
"tokens_reserved":0,
"use_token_limit":0,
"tokens_allowed":8,
"grouped_workqueues":[
{
"dev":"wq0.3",
"mode":"dedicated",
"size":16,
"group_id":3,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dsa0.3",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engine0.3",
"group_id":3
}
]
}
]
}
]
# 
davejiang commented 2 years ago

You don't have a conf file that only configures a single wq same as the commandline?

Can you do a 'accel-config list -i' after you have configured with commandline? Curious what accel-config has configured so far after commandline.

ozhuraki commented 2 years ago

@davejiang

You don't have a conf file that only configures a single wq same as the commandline?

Reducing the conf to fewer than 3 workqueus doesn't work, i.e. such configuration fails to load through "load-configuration".

Can you do a 'accel-config list -i' after you have configured with commandline?

# accel-config list
[
]
# accel-config list --idle | jq '.[].dev' | grep dsa
"dsa0"
"dsa1"
"dsa2"
"dsa3"
"dsa4"
"dsa5"
"dsa6"
"dsa7"
# accel-config list --idle | jq '.[0]'
{
"dev": "dsa0",
"token_limit": 0,
"max_groups": 4,
"max_work_queues": 8,
"max_engines": 4,
"work_queue_size": 128,
"numa_node": 0,
"op_cap": [
"0x1003f03ff",
"0",
"0",
"0"
],
"gen_cap": "0x40915f010f",
"version": "0x100",
"state": "disabled",
"max_tokens": 96,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"configurable": 1,
"pasid_enabled": 1,
"cdev_major": 234,
"clients": 0,
"groups": [
{
"dev": "group0.0",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_engines": [
{
"dev": "engine0.0",
"group_id": 0
}
]
},
{
"dev": "group0.1",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_engines": [
{
"dev": "engine0.1",
"group_id": 1
}
]
},
{
"dev": "group0.2",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_engines": [
{
"dev": "engine0.2",
"group_id": 2
}
]
},
{
"dev": "group0.3",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_engines": [
{
"dev": "engine0.3",
"group_id": 3
}
]
}
],
"ungrouped workqueues": [
{
"dev": "wq0.0",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.1",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.2",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.3",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.4",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.5",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.6",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.7",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
}
]
}
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# accel-config list --idle | jq '.[0]'
{
"dev": "dsa0",
"token_limit": 0,
"max_groups": 4,
"max_work_queues": 8,
"max_engines": 4,
"work_queue_size": 128,
"numa_node": 0,
"op_cap": [
"0x1003f03ff",
"0",
"0",
"0"
],
"gen_cap": "0x40915f010f",
"version": "0x100",
"state": "enabled",
"max_tokens": 96,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"configurable": 1,
"pasid_enabled": 1,
"cdev_major": 234,
"clients": 0,
"groups": [
{
"dev": "group0.0",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_workqueues": [
{
"dev": "wq0.0",
"mode": "dedicated",
"size": 16,
"group_id": 0,
"priority": 10,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"cdev_minor": 0,
"type": "user",
"name": "dsa0.0",
"threshold": 0,
"ats_disable": 0,
"state": "enabled",
"clients": 0
}
],
"grouped_engines": [
{
"dev": "engine0.0",
"group_id": 0
}
]
},
{
"dev": "group0.1",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_workqueues": [
{
"dev": "wq0.1",
"mode": "dedicated",
"size": 16,
"group_id": 1,
"priority": 10,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "user",
"name": "dsa0.1",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
}
],
"grouped_engines": [
{
"dev": "engine0.1",
"group_id": 1
}
]
},
{
"dev": "group0.2",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_workqueues": [
{
"dev": "wq0.2",
"mode": "dedicated",
"size": 16,
"group_id": 2,
"priority": 10,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "user",
"name": "dsa0.2",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
}
],
"grouped_engines": [
{
"dev": "engine0.2",
"group_id": 2
}
]
},
{
"dev": "group0.3",
"tokens_reserved": 0,
"use_token_limit": 0,
"tokens_allowed": 8,
"traffic_class_a": 0,
"traffic_class_b": 1,
"grouped_workqueues": [
{
"dev": "wq0.3",
"mode": "dedicated",
"size": 16,
"group_id": 3,
"priority": 10,
"block_on_fault": 1,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "user",
"name": "dsa0.3",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
}
],
"grouped_engines": [
{
"dev": "engine0.3",
"group_id": 3
}
]
}
],
"ungrouped workqueues": [
{
"dev": "wq0.4",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.5",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.6",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
},
{
"dev": "wq0.7",
"mode": "shared",
"size": 0,
"priority": 0,
"block_on_fault": 0,
"max_batch_size": 1024,
"max_transfer_size": 2147483648,
"type": "none",
"name": "",
"threshold": 0,
"ats_disable": 0,
"state": "disabled",
"clients": 0
}
]
}
# 
# accel-config list --idle | jq '.[].dev' | grep dsa
"dsa0"
"dsa1"
"dsa2"
"dsa3"
"dsa4"
"dsa5"
"dsa6"
"dsa7"
davejiang commented 2 years ago

I find the engines all pre-assigned to each group to be strange. Can you reboot, run that single accel-config config-wq, and then do the filtered accel-config list --idle please? Thanks!

mythi commented 2 years ago

@davejiang

Can you reboot

we find this rebooting a bit strange. would rmmod/modprobe idxd be enough as suggested by @ramesh-thomas earlier:

Can you try rebooting or resetting by unloading and reloading idxd module?

ozhuraki commented 2 years ago

@davejiang

Can you reboot

There are multiple users, unfortunately, this is problematic. Resetting by unloading/loading the idxd module was already tried https://github.com/intel/idxd-config/issues/11#issuecomment-944297432. Are there any other ways to reset the DSA HW?

While discovering this, an earlier observation is that "config-wq", "config-engine", "enable-device", "enable-wq", "disable..." works only a limited number of times after a reboot and was reproducible in multiple physical setups.

Since the identical configuration can be succesfully loaded and enabled through "load-configuration", is the problem in the order of setting the sysfs entries by accel-config in case of "config-wq" / "config-engine" / "enable-device"?

davejiang commented 2 years ago

You can unload module. But I really want a clean slate to see if this is a problem or something else caused this. Also, the 5.11 kernel is pretty old consider 5.15 is about to be released. The 5.11 probably has a lot of bugs that are fixed in later kernels. Unless you are reproducing a bug on the latest upstream kernel, there isn't much we can do. BTW, what silicon stepping are you using?