intel / DML

Intel® Data Mover Library (Intel® DML)
https://intel.github.io/DML/
MIT License
81 stars 17 forks source link

segfault with mutli-thread since the port ptr become null #19

Closed guoanwu closed 1 year ago

guoanwu commented 2 years ago

use the PR https://github.com/intel/DML/pull/18

performance test to run multi-thread with the following command ./examples/dml_example_c_api_perftest 128 10000 4096 0 16 allocatate the 4096 aligned src=0x7f3a7a388000, dst=0x7f3a79b86000

jobs=0x7f3a7c82f010 jobs=0x7f3a7c83ac10 jobs=0x7f3a7c846810 jobs=0x7f3a7c852410 jobs=0x7f3a7c85e010 jobs=0x7f3a7c869c10 jobs=0x7f3a7c875810 jobs=0x7f3a7c881410 jobs=0x7f3a7c88d010 jobs=0x7f3a7c898c10 Starting example for multi-job memory move jobs=0x7f3a7c846810: jobs=0x7f3a7c8a4810 Starting example for multi-job memory move jobs=0x7f3a7c83ac10: Starting example for multi-job memory move jobs=0x7f3a7c82f010: jobs=0x7f3a7c8b0410 jobs=0x7f3a7c8bc010 Starting example for multi-job memory move jobs=0x7f3a7c869c10: Starting example for multi-job memory move jobs=0x7f3a7c85e010: jobs=0x7f3a7c8c7c10 Starting example for multi-job memory move jobs=0x7f3a7c852410: jobs=0x7f3a7c8d3810 Starting example for multi-job memory move jobs=0x7f3a7c8c7c10: jobs=0x7f3a7c8df410 Starting example for multi-job memory move jobs=0x7f3a7c8d3810: Starting example for multi-job memory move jobs=0x7f3a7c898c10: Starting example for multi-job memory move jobs=0x7f3a7c8b0410: Starting example for multi-job memory move jobs=0x7f3a7c8df410: Starting example for multi-job memory move jobs=0x7f3a7c881410: Starting example for multi-job memory move jobs=0x7f3a7c88d010: Starting example for multi-job memory move jobs=0x7f3a7c8a4810: Starting example for multi-job memory move jobs=0x7f3a7c8bc010: Starting example for multi-job memory move jobs=0x7f3a7c875810: Segmentation fault (core dumped)

using gdb we will find that the port ptr is null: Thread 13 "dml_example_c_a" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffef1f4700 (LWP 123148)] 0x0000000000422057 in dml::core::dispatcher::hw_queue::enqueue_descriptor (this=0x62dc38 <dml::core::dispatcher::instance+56>, desc_ptr=0x7ffff7fac4c0) at /home/dennis/DML/sources/core/src/hw_dispatcher/hw_queue.cpp:92 92 : "a"(current_place_ptr), "d"(desc_ptr)); Missing separate debuginfos, use: yum debuginfo-install libgcc-8.5.0-4.el8_5.x86_64 libpmem-1.6.1-1.el8.x86_64 libstdc++-8.5.0-4.el8_5.x86_64 libuuid-2.32.1-28.el8.x86_64 (gdb) p current_place_ptr $1 = (void *) 0x0 (gdb)

From the logic, the port ptr can't be null since the portal_mask_never change, but in this case the portalmask changed to zero, so suspect some data overflow overwrite these data can cause the issue.

optimistyzy commented 2 years ago

@guoanwu Will apply this PR: https://github.com/intel/DML/pull/15 for a test?

mzhukova commented 1 year ago

Hi @guoanwu, this should be addressed with https://github.com/intel/DML/commit/e44443c24d53552b248b9869b1b16f89cd970f52, please let me know if you could verify or if I could close the issue.

mzhukova commented 1 year ago

Verified that it works correctly now, so closing the issue.