intel / DML

Intel® Data Mover Library (Intel® DML)
https://intel.github.io/DML/
MIT License
81 stars 17 forks source link

Error when using hardware path #40

Open chen982 opened 7 months ago

chen982 commented 7 months ago

When I run the command, It shows error like this: ./ll_crc_example hardware_path The example will be run on the hardware path. Starting CRC job example. Caclulating CRC for region of size 1KB. An error (100) occured during job execution.

So I try 2 steps:

  1. Check the .so is ok ldd /usr/bin/accel-config linux-vdso.so.1 (0x00007fffe05d5000) libaccel-config.so.1 => /usr/lib64/libaccel-config.so.1 (0x00007f14c4bdf000) libjson-c.so.4 => /lib/x86_64-linux-gnu/libjson-c.so.4 (0x00007f14c4bba000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f14c49c8000) /lib64/ld-linux-x86-64.so.2 (0x00007f14c4c0e000) 2.sudo python3 accel_conf.py --load=../configs/1n1d1e1w-s-n1.conf Filter: No active devices Loading configuration - done Additional configuration steps Force block on fault: False Enabling configured devices dsa0 - error wq0.0 - error

    failed in dsa0/wq0.0 enabled 0 wq(s) out of 1


Checking configuration No active devices

How should I do to fix it? And after step2 should it be ok to run the command ./ll_crc_example hardware_path?

chen982 commented 7 months ago

When I try to config manually, It shows something different: sudo accel-config load-config -c ../configs/1n1d1e1w-s-n1.conf

sudo accel-config enable-device dsa0 failed in dsa0 enabled 0 device(s) out of 1 Error[ 0x14] dsa0: Sum of WQCFG size fields out of range

if decrease the size, it shows wq group config error, So seems the config is not fitable? Then I update it manually to enable a share wq successfully.
Then I rerun ./ll_crc_example hardware_path . It remain to show An error (100) occured during job execution. The accel list info like this: [ { "dev":"dsa0", "read_buffer_limit":0, "max_groups":4, "max_work_queues":8, "max_engines":4, "work_queue_size":128, "numa_node":0, "gen_cap":"0x40915f0107", "version":"0x100", "state":"enabled", "max_read_buffers":96, "max_batch_size":1024, "ims_size":2048, "max_transfer_size":2147483648, "configurable":1, "pasid_enabled":1, "cdev_major":244, "clients":0, "groups":[ { "dev":"group0.0", "read_buffers_reserved":0, "use_read_buffer_limit":0, "read_buffers_allowed":8, "traffic_class_a":0, "traffic_class_b":1, "grouped_workqueues":[ { "dev":"wq0.0", "mode":"shared", "size":16, "group_id":0, "priority":10, "block_on_fault":1, "max_batch_size":1024, "max_transfer_size":2147483648, "cdev_minor":0, "type":"user", "name":"app1", "threshold":15, "ats_disable":0, "state":"enabled", "clients":0 } ], "grouped_engines":[ { "dev":"engine0.0", "group_id":0 }, { "dev":"engine0.1", "group_id":0 } ] }, { "dev":"group0.1", "read_buffers_reserved":0, "use_read_buffer_limit":0, "read_buffers_allowed":8, "traffic_class_a":0, "traffic_class_b":1, "grouped_engines":[ { "dev":"engine0.2", "group_id":1 }, { "dev":"engine0.3", "group_id":1 } ] }, { "dev":"group0.2", "read_buffers_reserved":0, "use_read_buffer_limit":0, "read_buffers_allowed":8, "traffic_class_a":0, "traffic_class_b":1 }, { "dev":"group0.3", "read_buffers_reserved":0, "use_read_buffer_limit":0, "read_buffers_allowed":8, "traffic_class_a":0, "traffic_class_b":1 } ] } ]

mzhukova commented 7 months ago

Hi @chen982, are you using this config file? Also, what kind of hardware do you have?

chen982 commented 7 months ago

Hi @chen982, are you using this config file? Also, what kind of hardware do you have?

yeah,i am using this config for no success。and the figured the dsa like that i post。but then run the hardware path to get 100。what should be the steps to use it?

mzhukova commented 7 months ago

Hi @chen982, are you using this config file? Also, what kind of hardware do you have?

yeah,i am using this config for no success。and the figured the dsa like that i post。but then run the hardware path to get 100。what should be the steps to use it?

Could you please do accel-config --version?

chen982 commented 7 months ago

你好@chen982,你是不是用这个配置文件? 还有,你有什么样的硬件?

是的,我使用此配置没有成功。和想通DSA一样,我post.but然后运行硬件路径得到100.what应该是使用它的步骤?

你能不能accel-config --version?

i install the accel-config from latest github for v4.1.3

mzhukova commented 7 months ago

@chen982 I think there are a couple of issues going on.

First of all, there is a configuring issue. It is confusing to me that you're not able to configure (you're getting "failed device" message), but you have non-empty accel-config list output. Also, this output doesn't seem to match what we have in config file.

I would recommend doing the following command in order to disable your current configuration:

sudo accel-config disable-wq dsa0/wq0.0
sudo accel-config disable-device dsa0

And then repeating sudo accel-config load-config -c ../configs/1n1d1e1w-s-n1.conf (with DML config file) etc.

If and only if this is resolved (meaning you're able to configure device correctly with DML config file) and you're still getting error code 100 out of DML, you might want to check that LD_LIBRARY_PATH include the location of libaccel-config library.

chen982 commented 7 months ago

@chen982 I think there are a couple of issues going on.

First of all, there is a configuring issue. It is confusing to me that you're not able to configure (you're getting "failed device" message), but you have non-empty accel-config list output. Also, this output doesn't seem to match what we have in config file.

I would recommend doing the following command in order to disable your current configuration:

sudo accel-config disable-wq dsa0/wq0.0
sudo accel-config disable-device dsa0

And then repeating sudo accel-config load-config -c ../configs/1n1d1e1w-s-n1.conf (with DML config file) etc.

If and only if this is resolved (meaning you're able to configure device correctly with DML config file) and you're still getting error code 100 out of DML, you might want to check that LD_LIBRARY_PATH include the location of libaccel-config library.

yeah, because when i use the config in dml , it shows error, so i use my own shared wq config to enable it successfully. And then not work for dml hardware path, it worked for my other job . So what version of kernel and idxd-config should i use. I am now using idxd-driver stage2.5(linux 5.12-rc8+) version kernel and 4.1.3 version accel-config . i think it should be the kernel source to get this problem. Can you give me the kernel and kernelconfig file that can properly run the dml?