llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.62k stars 11.83k forks source link

Incorrect result with OMP_TARGET_OFFLOAD=DISABLED #41738

Open ye-luo opened 5 years ago

ye-luo commented 5 years ago
Bugzilla Link 42393
Version unspecified
OS Linux
CC @alexey-bataev,@Rombur,@jdoerfert

Extended Description

include

include

The attached code gives correct results with OMP_TARGET_OFFLOAD=MANDATORY but wrong results with OMP_TARGET_OFFLOAD=DISABLED

OMP_NUM_THREADS=2 OMP_TARGET_OFFLOAD=MANDATORY ./a.out tid = 1 0 1 2 3 tid = 0 0 1 2 3

$ OMP_NUM_THREADS=2 OMP_TARGET_OFFLOAD=DISABLED ./a.out tid = 0 0 1 0 0 tid = 1 0 0 2 3

int main() { const int size = 4;

pragma omp parallel

{ int A[size]; for(int i = 0; i < size; i++) A[i] = 0;

#pragma omp target teams distribute map(tofrom: A[:size])
for(int i = 0; i < size; i++)
{
  A[i] = i;
}

#pragma omp critical
{
  std::cout << "tid = " << omp_get_thread_num() << std::endl;
  for(int i = 0; i < size; i++)
    std::cout << "  " << A[i];
  std::cout << std::endl;
}

} }

jdoerfert commented 4 years ago

On the host we generate a omp for like loop for the distribute which binds to the outer parallel. Thus, each thread of the outer parallel executes only a single iteration of the A[i] = i loop instead of the entire thing.

If omp for like loop is generated for distribute, omp parallel like region should be generated for teams. binding to the outer parallel is clearly wrong.

Agreed.

ye-luo commented 4 years ago

On the host we generate a omp for like loop for the distribute which binds to the outer parallel. Thus, each thread of the outer parallel executes only a single iteration of the A[i] = i loop instead of the entire thing.

If omp for like loop is generated for distribute, omp parallel like region should be generated for teams. binding to the outer parallel is clearly wrong.

jdoerfert commented 4 years ago

On the host we generate a omp for like loop for the distribute which binds to the outer parallel. Thus, each thread of the outer parallel executes only a single iteration of the A[i] = i loop instead of the entire thing.

ye-luo commented 4 years ago

I tried compiling this without offload $clang++ -fopenmp debug.cpp $ OMP_NUM_THREADS=2 ./a.out tid = 0 0 1 0 0 tid = 1 0 0 2 3 Probably something is wrong with libomp already.

alexey-bataev commented 5 years ago

After some investigation, seems to me it is the problem in libomp. When we schedule the distribute loop, the tid is taken from the outer threads (though it should set to 0 in all cases) and the number of threads is taken from the outer parallel region (though seems to me, it should be set to 1). It would be good to check if libomp works correctly here.

alexey-bataev commented 5 years ago

The problem caused by kmpc_push_target_tripcount function. It is not threadsafe and has data race, which causes incorrect results.

Also, seems to me, there is a problem with the runtime, will investigate this.

alexey-bataev commented 5 years ago

The problem caused by kmpc_push_target_tripcount function. It is not threadsafe and has data race, which causes incorrect results.