Using data arbiters leads to segfaults and mutex errors #36

Closed Muxas closed 8 months ago

Preliminary story

I am trying to improve parallel performance of NNTile on a server with several GPUs. Unfortunately, STARPU_REDUX access mode leads to much worse performance, compared to STARPU_RW|STARPU_COMMUTE access mode, but output result is nearly the same. Today I tried to take advantage of StarPU data arbiters. And I got several different errors when running my application. I made a separate arbiter for each matrix -- hope this is how it shall be used.

Steps to reproduce

I am using StarPU of starpu-1.3 tag of GitLab repo (commit 1ace9c2ac6dccca341d4c4ce08f924581318c808). I enabled arbiter for every matrix in my application as I found it in a corresponding example tests/datawizard/test_arbiter.cpp. When I run my application on a server with GPUs I get different errors. For example:

pthread_mutex_lock.c:438: __pthread_mutex_lock_full: Assertion `e != ESRCH || !robust' failed.

with corresponding backtrace + config.log

or other error:

[starpu][abort][_starpu_attempt_to_submit_arbitered_data_request()@core/dependencies/data_arbiter_concurrency.c:273]

with corresponding backtrace config.log is the same as above.

CUDA version is 12.2.

Sorry, that was a typo in my code, that did not create all the used arbiters.

starpu-runtime / starpu

Using data arbiters leads to segfaults and mutex errors #36

Preliminary story

Steps to reproduce