intel / DML

Intel® Data Mover Library (Intel® DML)
https://intel.github.io/DML/
MIT License
81 stars 17 forks source link

Debugging hardware path on Sapphire Rapids #25

Closed Shamazo closed 1 year ago

Shamazo commented 1 year ago

Hi,

I am unable to run hardware mode examples/tests. I did a fresh clone from master and built using GCC.

To configure the DSA and kernel, I followed the DSA user guide. I believe I have configured the DSA correctly because I can run the dsa_perf_micros scripts e.g.

sudo ./src/dsa_perf_micros -n128 -s16k -j -c -f -i8000 -k5 -w0 -zF,F -o3
[sudo] password for user1:
./src/dsa_perf_micros -n128 -s16k -j -c -f -i8000 -k5 -w0 -zF,F -o3
-j option is deprecated (default behavior)
blen                      16384
bstride                   16384
bstride                   16384
nb_bufs                     128
pg_size                       0
wq_type                       0
batch_sz                      1
iter                       8000
nb_cpus                       1
var_mmio                      1
dma                           1
verify                        1
misc_flags                    0
access_op[0]               Write
access_op[1]               Write
place_op[0]              Memory
place_op[1]              Memory
flags_cmask            ffffffff
flags_smask                   0
flags_nth_desc                1
nb_numa_node                 16
cpu_desc_work                 0
Memory affinity
CPUs in node 0:     -1 -1
Buffer Offsets      0 0
GB per sec = 31.170166 cpu 6.270452 kopsrate = 1902

However, I cannot run any of the tests/examples in DML with hardware mode, e.g.

[user1@sprnode5 high-level-api]$ ./hl_mem_move_example_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Failure occurred.
[user1@sprnode5 high-level-api]$ ./hl_mem_move_example_example software_path
Executing using dml::software path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.

(Note I do get the same output regardless of whether I use sudo or not, I have chowned the work queues to set the group ownership to my users group.)

Similarly all tests pass with ./tests --path=sw and I get a very very large stream of unsuccessful output with ./tests --path=hw. A small sample here

Details: CPU: Intel (R) Xeon (R) CPU Max 9480

[user1@sprnode5 dsa_perf_micros]$ uname -r
6.3.0-2.el9.elrepo.x86_64
[user1@sprnode5 dsa_perf_micros]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.1 (Blue Onyx)"
[user1@sprnode5 dsa_perf_micros]$ gcc --version
gcc (GCC) 12.2.0

Full DSA config here

Is there anything in the setup I am forgetting/missing?

Thanks in advance, Hamish

mzhukova commented 1 year ago

Hi @Shamazo, I've noticed error code 100 in the output of your tests which is DML_STATUS_LIBACCEL_NOT_FOUND, let's double check on that first:

Additionally, when you're running examples, could you please check the result.status which leads to Failure reported?

Shamazo commented 1 year ago

Hi @mzhukova

I cloned the develop branch, so commit 4cf7cab374ef0869d91c1b02d683d334d59f27d3

accel-config was installed via dnf.

[user1@sprnode5 high-level-api]$ ldd hl_batch_example_example
    linux-vdso.so.1 (0x00007ffce4d73000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f67f5600000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f67f5525000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f67f5853000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f67f5200000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f67f587b000)
[user1@sprnode5 high-level-api]$ which accel-config
/usr/bin/accel-config
[user1@sprnode5 high-level-api]$ accel-config --version
3.4.6.3
[user1@sprnode5 high-level-api]$ ldd /usr/bin/accel-config
    linux-vdso.so.1 (0x00007ffdfcf94000)
    libaccel-config.so.1 => /lib64/libaccel-config.so.1 (0x00007f522ea83000)
    libjson-c.so.5 => /lib64/libjson-c.so.5 (0x00007f522ea70000)
    libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f522ea67000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f522e800000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f522eabc000)

In the high level examples, I get error code 16 / dml::status_code::error

Printing out with std::cout << "Failure occurred. Error code: " << static_cast<std::underlying_type<dml::status_code>::type>(result.status) << std::endl;

/tmp/tmp.wNEiagzC7n/cmake-build-release/external/DML/examples/high-level-api/hl_mem_move_example_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Failure occurred. Error code: 16

Thanks, Hamish

Shamazo commented 1 year ago

Hi @mzhukova,

Is there any other information I can provide? I only have access to this machine for a couple more days.

Thanks, Hamish

mzhukova commented 1 year ago

Hi @Shamazo, apologies for the delay in response, I see in your output for accel-config list that you're trying to use dedicated work queue, is this correct? If so, this is not supported in DML, see Library Limitations section.

Is there any particular reason you need the DWQ supported?

Shamazo commented 1 year ago

Thank you for pointing that out, I had missed it in the limitations. I may suggest putting that limitation in the configuration part of the installation instructions.

I don't fundamentally need DWQ, but I work on systems that are effectively single-tenant. So I thought that it may be more performant to use DWQ since I don't need to share the DSA resource. I have not yet measured the performance impact of shared vs direct queues.

Closing this issue for now.

I will let you know if I run into a specific reason to support DWQ.

mzhukova commented 1 year ago

Sure @Shamazo, I'll try to make it more clear in documentation. Thanks!