jdoerfert commented 4 years ago


Bugzilla Link	45047
Version	unspecified
OS	Linux
CC	@AndreyChurbanov,@jpeyton52,@mjklemm

Extended Description

In the following simple example we store A twice into the task structure (https://godbolt.org/z/Wq9xx6). The second time is especially problematic since we use the hardcoded 40 byte offset from the task_alloc return instead of loading the shared pointer value (first in the task struct).

void foo(int *A) {

pragma omp task

{
    *A = 0;
}

}

jdoerfert commented 4 years ago

There is two problems here: 1) We capture it twice, which will be gone with the OpenMPIRBuilder code path. 2) We hard coded the offset into the IR which tightly couples the IR to the specific runtime implementation. The proper way is describe by Michael.

AndreyChurbanov commented 4 years ago

When the generated code invokes __kmpc_omp_task_alloc, it seems to pass wrong byte counts:

__kmpc_omp_task_alloc: sizeof_kmp_task_t=48 sizeof_shareds=8

This should be:

__kmpc_omp_task_alloc: sizeof_kmp_task_t=40 sizeof_shareds=8

as sizeof(kmp_task_t) is 40. I'd say it should better be __kmpc_omp_task_alloc: sizeof_kmp_task_t=48 sizeof_shareds=0

Because firstprivate is written to kmp_task_t_with_privates (increasing its size from 40 to 48 bytes), while sample code does not have any shareds to propagate to the task.

I indeed misread the IR, and looking at the assembler generated for Michael's example I see the value of the variable is written twice (not value and address as I suspected initially). So Johannes' investigation looks correct.

Though I still think the write to shareds may be treated as "wrong", while the write to privates is correct.

mjklemm commented 4 years ago

I'm not 100% sure about my analysis, but I think there's another issue associated with this bug report. It also seems that the compiler not only writes the task memory twice, it also causes some extra memory consumption. When the generated code invokes __kmpc_omp_task_alloc, it seems to pass wrong byte counts:

__kmpc_omp_task_alloc: sizeof_kmp_task_t=48 sizeof_shareds=8

This should be:

__kmpc_omp_task_alloc: sizeof_kmp_task_t=40 sizeof_shareds=8

as sizeof(kmp_task_t) is 40. With tasks that receive more data, it seems that clang incorrectly adds the size of the shareds space to the size of the kmp_task_t structure. In the runtime, the allocation size is computed using sizeof_kmp_task_t+sizeof_shareds, so some extra memory is allocated that is not needed.

In addition, the way the duplicated writes are done store the duplicated values at task+0x28 and task->shareds:

0x400879 callq __kmpc_omp_task_alloc 0x40087e mov (%rax),%rcx # load "task->shareds" (address: task+48b) 0x400881 mov -0x10(%rbp),%rdi # load 'd' from stack 0x400885 mov %rdi,(%rcx) # write to "(double)task->shareds" 0x400888 movsd (%rcx),%xmm0 # load 'd' into xmm0 0x40088c movsd %xmm0,0x28(%rax) # write 'd' at task+40b

This only works correctly, because the OpenMP runtime allocates the unneeded, extra amount (see kmp_tasking.cc:1229-1245):

1227 // Calculate shared structure offset including padding after kmp_task_t struct 1228 // to align pointers in shared struct 1229 shareds_offset = sizeof(kmp_taskdata_t) + sizeof_kmp_task_t; 1230 shareds_offset = __kmp_round_up_to_val(shareds_offset, sizeof(void )); [...] 1240 taskdata = (kmp_taskdata_t )kmp_fast_allocate(thread, shareds_offset + 1241 sizeof_shareds); [...] 1243 taskdata = (kmp_taskdata_t *)kmp_thread_malloc(thread, shareds_offset + 1244 sizeof_shareds);

Reproducer:

// compile with: clang -fopenmp -o reproducer reproducer.c

include

void create_task(double d) {

pragma omp task firstprivate(d)

{
    printf("d=%lf\n", d);
}

}

int main(int argc, char *argv[]) { double d = 42.0;

pragma omp parallel

{
    if (omp_get_thread_num() == 0) {
        create_task(d);
    }
}
return 0;

}

mjklemm commented 4 years ago

The problem IMHO is that the offset used (40) is computed statically instead of simply following the "shareds" pointer in kmp_task_t. So, the proper way of referring to the storage location would be to have assignments like this:

task = __kmpc_omp_task_alloc(...) (void)(((char)task->shared) + 0) = ptr; (double)(((char)task->shared) + 8) = some_dp_value; (int)(((char*)task->shared) + 16) = some_other_value;

Right now, the compiler seems to do this:

task = __kmpc_omp_task_alloc(...) (void)(((char)task) + 40) = ptr; (double)(((char)task) + 48) = some_dp_value; (int)(((char*)task) + 56) = some_other_value;

Which I you think about a runtime that tries to be compatible requires you to do the same trick as the LLVM OpenMP runtime and pre-prend the kmp_taskdata_t struct /before/ kmp_task_t, instead of simply allowing to extend the kmp_task_t structure.

AndreyChurbanov commented 4 years ago

Sorry if I misread the IR (still newbie on that), but what I see on Godbolt is: pointer A is written to task structure once - to privates area (at offset 40).

And it is read from that location inside the outlined task routine.

What is unclear to me - why stack address (where pointer A is also written) is put into the shareds area of the task structure? This address is anyway going to be stale during task execution... So to me the first store into the task structure looks suspicious.

BTW, according to OpenMP spec the pointer A should be firstprivate, so keeping it in privates looks reasonable.

llvm / llvm-project

Captured variables stored twice into task #44392

Extended Description

pragma omp task

include

include

pragma omp task firstprivate(d)

pragma omp parallel