Open Muxas opened 5 months ago
I'm not sure how to use your reproducer since it only puts zeroes in the value in cpu memory. But I guess it's the missing initialized = 0 case that the starpu-1.3 branch indeed didn't have, and that was indeed uncovered by this new situation (in the past we wouldn't have replicates that are allocated but not initialized or planned to be). I have pushed a fix to the starpu-1.3 branch.
The fix did not solve the problem. I believe you did not catch what I was trying to convey. You do not have to use my reproducer. You need to take a look at these lines:
starpu_task_insert(&set_codelet, STARPU_W, x_handle, 0); // Init X
starpu_data_invalidate_submit(x_handle); // Invalidate X
//starpu_task_insert(&set_codelet, STARPU_W, x_handle, 0); // Init X is ignored
starpu_task_insert(&use_codelet, STARPU_R, x_handle, 0); // Read X right after invalidation without an error
order of tasks for handle X
:
set_codelet
: set X
to zero. This is done in STARPU_W
mode, so StarPU does memory allocation.invalidate_codelet
: invalidates handle X
and frees previously allocated memory.use_codelet
: reads unallocated X
. Memory is allocated by StarPU and it might be filled with some random data. StarPU gives this random data to my use_codelet
to update value. You can add printf
function to the use_codelet
and it will print random uninitialized data.The fix did not solve the problem.
The behavior is unchanged completely?
I believe you did not catch what I was trying to convey. You do not have to use my reproducer. You need to take a look at these lines:
starpu_task_insert(&set_codelet, STARPU_W, x_handle, 0); // Init X starpu_data_invalidate_submit(x_handle); // Invalidate X //starpu_task_insert(&set_codelet, STARPU_W, x_handle, 0); // Init X is ignored starpu_task_insert(&use_codelet, STARPU_R, x_handle, 0); // Read X right after invalidation without an error
1. Init handle X. 2. Invalidate handle X. From this point it cannot be access in read mode. 3. Use handle X in STARPU_R mode. It shall be in an uninitialized state at this point, so accessing it as STARPU_R must be prohibited.
Prohibited? Why? The starpu_data_set_reduction_methods
call provides the initializer, so starpu just needs to know that the data is now uninitialized, and thus has to call the init_cl again. This is what the addition of replicate->initialized = 0 is supposed to bring, so that _starpu_fetch_task_input_tail
sets needs_init
to 1 and thus _starpu_init_data_replicate
gets called.
With my fix, I see clear_func
getting called, while before it wasn't getting called and thus the value was indeed undefined in use_func
Method starpu_data_set_reduction_methods
sets reduction methods. But there is no reduction in the provided example. I have added starpu_invalidate_submit
in a wrong place accidentally, but instead of indicating that I did mistake StarPU clear the buffer. I did not ask to clear it! This will lead to many errors due to implicit clearing. Other StarPU user might put starpu_invalidate_submit
in a wrong place and it will not raise any error. This kind of implicit behavior is very hard to debug!
Just to conclude on my side:
STARPU_REDUX
or STARPU_MPI_REDUX
access modes.I understand, that if it is done like I see it, then it might break STARPU_MPI_REDUX
mode. Because locally on a single node STARPU_MPI_REDUX
will be translated into STARPU_RW|STARPU_COMMUTE
access mode, which shall not support implicit initializer judging by my opinion.
My proposition: add a new access mode STARPU_ALLOW_IMPLICIT_INIT
. A user can use this flag to explicitly tell StarPU that it is OK to use initializer implicitly in case of accessing uninitialized data for specific tasks, only where this access mode appears. With this approach, user is warned about possible unintended initializations, at least. And all the STARPU_REDUX and STARPU_MPI_REDUX will include the flag automatically.
I am just wondering if my proposition makes sense to you. I described the way how I would implement it myself.
Hi!
I have just added
starpu_data_invalidate_submit
to my code. Of course, I did it with mistakes. Some cases were reported by StarPU, signaling that some data is not initialized to be read. But some cases were not. I found out, that marking data withstarpu_data_set_reduction_methods
makes it immune to such an assert if the data remains on the same device and its access mode is STARPU_R.If the data is on GPU and reduction methods only support CPUs, then the following error is printed:
However, if the reduction methods are supported on device, where the data is allocated, then program is not stopped, no error is thrown and result of computations becomes silently wrong (undefined behavior).
Here is a simple program to reproduce: