Questions about preventing OpenMP OpenMP parallel regions.

I don't understand why the OpenMP parallel regions are not being merged during OpenMP optimization in the following code.

program  main
  implicit none
  integer :: a
  call f(a)
  !$omp parallel
  a = a + 1
  !$omp end parallel
  !$omp parallel
    a = a + 1
  !$omp end parallel
end program  main

I found the reasons for preventing the merging of parallel regions in OpenMPopt.cpp.

if (IsBeforeMergableRegion) {
          Function *CalledFunction = CI->getCalledFunction();
          if (!CalledFunction)
            return false;
          // Return false (unmergable) if the call before the parallel
          // region calls an explicit affinity (proc_bind) or number of
          // threads (num_threads) compiler-generated function. Those settings
          // may be incompatible with following parallel regions.
          // TODO: ICV tracking to detect compatibility.
          for (const auto &RFI : UnmergableCallsInfo) {
            if (CalledFunction == RFI.Declaration)
              return false;
          }
        }

However, I cannot understand the logic behind this part. In what situations would merging parallel regions after calling an explicit affinity (proc_bind) or number of threads (num_threads) compiler-generated function lead to errors?

@llvm/issue-subscribers-openmp

First of all, https://discourse.llvm.org/c/runtimes/openmp/ would be a better place to ask question. GitHub issues are more for (suspect) bug report.

Regarding the question, say if we have two parallel regions:

parallel num_threads(4)
{ ... }
parallel
{ ... }

They can not be merged because we can't determine what the number of threads should be used. If we follow the first parallel region, then the second parallel region can only use 4 as well, while it can potentially use more. On the other hand, if we follow the second parallel region, it breaks OpenMP semantics as in this case we might use more than 4 threads for the first parallel region as well. As a result, we choose to be conservative here. Same thing applies to proc_bind clause as well.

However, the two parallel regions you showed here could be potentially merged.

Can you show the IR of the Fortran code, maybe it's an artifact that we do not merge. Also, merging is off by default, IIRC.

Can you show the IR of the Fortran code, maybe it's an artifact that we do not merge. Also, merging is off by default, IIRC.

Here is the partial IR generated Perhaps the failure to merge has something to do with Fortran not having function declarations?

define void @MAIN_() local_unnamed_addr #0 !dbg !6 {
L.entry:
  %a_330 = alloca i32, align 4
  %.uplevelArgPack0001_346 = alloca %astruct.dt57, align 8
  %.uplevelArgPack0002_369 = alloca %astruct.dt63, align 8
  %0 = tail call i32 @__kmpc_global_thread_num(ptr null), !dbg !9
  tail call void (...) @fort_init(ptr nonnull @.C310_MAIN_), !dbg !10
  call void (ptr, ...) @f_(ptr nonnull %a_330), !dbg !11
  store ptr %a_330, ptr %.uplevelArgPack0001_346, align 8, !dbg !12, !tbaa !13
  call void (ptr, i32, ptr, ptr, ...) @__kmpc_fork_call(ptr null, i32 1, ptr nonnull @__nv_MAIN__F1L6_1_, ptr nonnull %.uplevelArgPack0001_346), !dbg !12
  store ptr %a_330, ptr %.uplevelArgPack0002_369, align 8, !dbg !17, !tbaa !18
  call void (ptr, i32, ptr, ptr, ...) @__kmpc_fork_call(ptr null, i32 1, ptr nonnull @__nv_MAIN__F1L9_2_, ptr nonnull %.uplevelArgPack0002_369), !dbg !17
  ret void, !dbg !9
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define internal void @__nv_MAIN__F1L6_1_(ptr nocapture readnone %__nv_MAIN__F1L6_1Arg0, ptr nocapture readnone %__nv_MAIN__F1L6_1Arg1, ptr nocapture readonly %__nv_MAIN__F1L6_1Arg2) #1 !dbg !20 {
L.entry:
  %0 = load ptr, ptr %__nv_MAIN__F1L6_1Arg2, align 8, !dbg !25, !tbaa !26
  %1 = load i32, ptr %0, align 4, !dbg !25, !tbaa !30
  %2 = add nsw i32 %1, 1, !dbg !25
  store i32 %2, ptr %0, align 4, !dbg !25, !tbaa !30
  ret void, !dbg !32
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define internal void @__nv_MAIN__F1L9_2_(ptr nocapture readnone %__nv_MAIN__F1L9_2Arg0, ptr nocapture readnone %__nv_MAIN__F1L9_2Arg1, ptr nocapture readonly %__nv_MAIN__F1L9_2Arg2) #1 !dbg !33 {
L.entry:
  %0 = load ptr, ptr %__nv_MAIN__F1L9_2Arg2, align 8, !dbg !34, !tbaa !35
  %1 = load i32, ptr %0, align 4, !dbg !34, !tbaa !39
  %2 = add nsw i32 %1, 1, !dbg !34
  store i32 %2, ptr %0, align 4, !dbg !34, !tbaa !39
  ret void, !dbg !41
}

declare void @f_(...) local_unnamed_addr #0

declare void @fort_init(...) local_unnamed_addr #0

; Function Attrs: nounwind
declare signext i32 @__kmpc_global_thread_num(ptr) local_unnamed_addr #2

; Function Attrs: nounwind
declare void @__kmpc_fork_call(ptr, i32, ptr, ptr, ...) local_unnamed_addr #2

attributes #0 = { "fp-contract"="fast" }
attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn "fp-contract"="fast" }
attributes #2 = { nounwind "fp-contract"="fast" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.dbg.cu = !{!3}
!nvvm.annotations = !{}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 7, !"openmp", i32 50}

First of all, https://discourse.llvm.org/c/runtimes/openmp/ would be a better place to ask question. GitHub issues are more for (suspect) bug report.

Regarding the question, say if we have two parallel regions:
parallel num_threads(4)
{ ... }
parallel
{ ... }
They can not be merged because we can't determine what the number of threads should be used. If we follow the first parallel region, then the second parallel region can only use 4 as well, while it can potentially use more. On the other hand, if we follow the second parallel region, it breaks OpenMP semantics as in this case we might use more than 4 threads for the first parallel region as well. As a result, we choose to be conservative here. Same thing applies to proc_bind clause as well.

However, the two parallel regions you showed here could be potentially merged.

Thank you for your reply, I will ask related questions on the website you provided in the future. I can understand what you said, so the case I provided should be able to continue to be merged, right?

The problem is the store in-between the parallel regions: store ptr %a_330, ptr %.uplevelArgPack0002_369, align 8 which we currently do not move out of the way, IIRC.

I was going to try it out to confirm, but you did not paste the entire IR, and fixing all the issues that arise is cumbersome.

This is the entire IR:

; ModuleID = '/tmp/mergepr-efc903.ll'
source_filename = "/tmp/mergepr-efc903.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%astruct.dt57 = type <{ ptr }>
%astruct.dt63 = type <{ ptr }>

@.C310_MAIN_ = internal constant i32 0

define void @MAIN_() local_unnamed_addr #0 !dbg !6 {
L.entry:
  %a_330 = alloca i32, align 4
  %.uplevelArgPack0001_346 = alloca %astruct.dt57, align 8
  %.uplevelArgPack0002_369 = alloca %astruct.dt63, align 8
  %0 = tail call i32 @__kmpc_global_thread_num(ptr null), !dbg !9
  tail call void (...) @fort_init(ptr nonnull @.C310_MAIN_), !dbg !10
  call void (ptr, ...) @f_(ptr nonnull %a_330), !dbg !11
  store ptr %a_330, ptr %.uplevelArgPack0001_346, align 8, !dbg !12, !tbaa !13
  call void (ptr, i32, ptr, ptr, ...) @__kmpc_fork_call(ptr null, i32 1, ptr nonnull @__nv_MAIN__F1L6_1_, ptr nonnull %.uplevelArgPack0001_346), !dbg !12
  store ptr %a_330, ptr %.uplevelArgPack0002_369, align 8, !dbg !17, !tbaa !18
  call void (ptr, i32, ptr, ptr, ...) @__kmpc_fork_call(ptr null, i32 1, ptr nonnull @__nv_MAIN__F1L9_2_, ptr nonnull %.uplevelArgPack0002_369), !dbg !17
  ret void, !dbg !9
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define internal void @__nv_MAIN__F1L6_1_(ptr nocapture readnone %__nv_MAIN__F1L6_1Arg0, ptr nocapture readnone %__nv_MAIN__F1L6_1Arg1, ptr nocapture readonly %__nv_MAIN__F1L6_1Arg2) #1 !dbg !20 {
L.entry:
  %0 = load ptr, ptr %__nv_MAIN__F1L6_1Arg2, align 8, !dbg !25, !tbaa !26
  %1 = load i32, ptr %0, align 4, !dbg !25, !tbaa !30
  %2 = add nsw i32 %1, 1, !dbg !25
  store i32 %2, ptr %0, align 4, !dbg !25, !tbaa !30
  ret void, !dbg !32
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define internal void @__nv_MAIN__F1L9_2_(ptr nocapture readnone %__nv_MAIN__F1L9_2Arg0, ptr nocapture readnone %__nv_MAIN__F1L9_2Arg1, ptr nocapture readonly %__nv_MAIN__F1L9_2Arg2) #1 !dbg !33 {
L.entry:
  %0 = load ptr, ptr %__nv_MAIN__F1L9_2Arg2, align 8, !dbg !34, !tbaa !35
  %1 = load i32, ptr %0, align 4, !dbg !34, !tbaa !39
  %2 = add nsw i32 %1, 1, !dbg !34
  store i32 %2, ptr %0, align 4, !dbg !34, !tbaa !39
  ret void, !dbg !41
}

declare void @f_(...) local_unnamed_addr #0

declare void @fort_init(...) local_unnamed_addr #0

; Function Attrs: nounwind
declare signext i32 @__kmpc_global_thread_num(ptr) local_unnamed_addr #2

; Function Attrs: nounwind
declare void @__kmpc_fork_call(ptr, i32, ptr, ptr, ...) local_unnamed_addr #2

attributes #0 = { "fp-contract"="fast" }
attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn "fp-contract"="fast" }
attributes #2 = { nounwind "fp-contract"="fast" }

!llvm.module.flags = !{!0, !1, !2}
!llvm.dbg.cu = !{!3}
!nvvm.annotations = !{}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 7, !"openmp", i32 50}
!3 = distinct !DICompileUnit(language: DW_LANG_Fortran90, file: !4, producer: " F90 Flang - 1.5 2017-05-01", isOptimized: true, flags: "'+flang mergepr.f90 -fopenmp -O2 -S -emit-llvm -Rpass=openmp-opt -mllvm -openmp-opt-enable-merging'", runtimeVersion: 0, emissionKind: FullDebug, enums: !5, retainedTypes: !5, globals: !5, imports: !5, nameTableKind: None)
!4 = !DIFile(filename: "mergepr.f90", directory: "/home/xieyihui/work15/f_openmp")
!5 = !{}
!6 = distinct !DISubprogram(name: "main", scope: !3, file: !4, line: 2, type: !7, scopeLine: 2, spFlags: DISPFlagDefinition | DISPFlagOptimized | DISPFlagMainSubprogram, unit: !3, retainedNodes: !5)
!7 = !DISubroutineType(cc: DW_CC_program, types: !8)
!8 = !{null}
!9 = !DILocation(line: 12, column: 1, scope: !6)
!10 = !DILocation(line: 2, column: 1, scope: !6)
!11 = !DILocation(line: 5, column: 1, scope: !6)
!12 = !DILocation(line: 6, column: 1, scope: !6)
!13 = !{!14, !14, i64 0}
!14 = !{!"t1.3", !15, i64 0}
!15 = !{!"unlimited ptr", !16, i64 0}
!16 = !{!"Flang FAA 1"}
!17 = !DILocation(line: 9, column: 1, scope: !6)
!18 = !{!19, !19, i64 0}
!19 = !{!"t1.5", !15, i64 0}
!20 = distinct !DISubprogram(name: "__nv_MAIN__F1L6_1", scope: !3, file: !4, line: 6, type: !21, scopeLine: 6, spFlags: DISPFlagLocalToUnit | DISPFlagDefinition | DISPFlagOptimized, unit: !3, retainedNodes: !5)
!21 = !DISubroutineType(types: !22)
!22 = !{null, !23, !24, !24}
!23 = !DIBasicType(name: "integer", size: 32, align: 32, encoding: DW_ATE_signed)
!24 = !DIBasicType(name: "integer*8", size: 64, align: 64, encoding: DW_ATE_signed)
!25 = !DILocation(line: 7, column: 1, scope: !20)
!26 = !{!27, !27, i64 0}
!27 = !{!"t2.3", !28, i64 0}
!28 = !{!"unlimited ptr", !29, i64 0}
!29 = !{!"Flang FAA 2"}
!30 = !{!31, !31, i64 0}
!31 = !{!"t2.5", !28, i64 0}
!32 = !DILocation(line: 8, column: 1, scope: !20)
!33 = distinct !DISubprogram(name: "__nv_MAIN__F1L9_2", scope: !3, file: !4, line: 9, type: !21, scopeLine: 9, spFlags: DISPFlagLocalToUnit | DISPFlagDefinition | DISPFlagOptimized, unit: !3, retainedNodes: !5)
!34 = !DILocation(line: 10, column: 1, scope: !33)
!35 = !{!36, !36, i64 0}
!36 = !{!"t3.3", !37, i64 0}
!37 = !{!"unlimited ptr", !38, i64 0}
!38 = !{!"Flang FAA 3"}
!39 = !{!40, !40, i64 0}
!40 = !{!"t3.5", !37, i64 0}
!41 = !DILocation(line: 11, column: 1, scope: !33)

Note: The issue posted by @Ehu1 comes from Classic Flang. But the same issue is present with llvm/flang as well.

Source

subroutine sb
  integer :: x, y
  !$omp parallel
    print *, x
  !$omp end parallel
  !$omp parallel
    print *, y
  !$omp end parallel
end subroutine

LLVM IR

; ModuleID = 'FIRModule'
source_filename = "FIRModule"
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

%struct.ident_t = type { i32, i32, i32, i32, ptr }

$_QQcl.0c0b6aeb747cfadba573ef6dd9ebe4a5 = comdat any

@_QQcl.0c0b6aeb747cfadba573ef6dd9ebe4a5 = linkonce constant [49 x i8] c"/home/kircha02/llvm-project/build/double_par.f90\00", comdat
@0 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1
@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8

; Function Attrs: nounwind
define void @sb_() local_unnamed_addr #0 {
entry:
  %structArg16 = alloca { ptr }, align 8
  %structArg = alloca { ptr }, align 8
  %0 = alloca i32, align 4
  %1 = alloca i32, align 4
  %omp_global_thread_num2 = tail call i32 @__kmpc_global_thread_num(ptr nonnull @1)
  store ptr %0, ptr %structArg, align 8
  call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr nonnull @1, i32 1, ptr nonnull @sb_..omp_par, ptr nonnull %structArg)
  store ptr %1, ptr %structArg16, align 8
  call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr nonnull @1, i32 1, ptr nonnull @sb_..omp_par.1, ptr nonnull %structArg16)
  ret void
}

; Function Attrs: norecurse nounwind
define internal void @sb_..omp_par.1(ptr noalias nocapture readnone %tid.addr3, ptr noalias nocapture readnone %zero.addr4, ptr nocapture readonly %0) #1 {
omp.par.entry5:
  %loadgep_ = load ptr, ptr %0, align 8
  %1 = tail call ptr @_FortranAioBeginExternalListOutput(i32 -1, ptr nonnull @_QQcl.0c0b6aeb747cfadba573ef6dd9ebe4a5, i32 7) #0
  %2 = load i32, ptr %loadgep_, align 4, !tbaa !4
  %3 = tail call i1 @_FortranAioOutputInteger32(ptr %1, i32 %2) #0
  %4 = tail call i32 @_FortranAioEndIoStatement(ptr %1) #0
  ret void
}

; Function Attrs: norecurse nounwind
define internal void @sb_..omp_par(ptr noalias nocapture readnone %tid.addr, ptr noalias nocapture readnone %zero.addr, ptr nocapture readonly %0) #1 {
omp.par.entry:
  %loadgep_ = load ptr, ptr %0, align 8
  %1 = tail call ptr @_FortranAioBeginExternalListOutput(i32 -1, ptr nonnull @_QQcl.0c0b6aeb747cfadba573ef6dd9ebe4a5, i32 4) #0
  %2 = load i32, ptr %loadgep_, align 4, !tbaa !4
  %3 = tail call i1 @_FortranAioOutputInteger32(ptr %1, i32 %2) #0
  %4 = tail call i32 @_FortranAioEndIoStatement(ptr %1) #0
  ret void
}

declare ptr @_FortranAioBeginExternalListOutput(i32, ptr, i32) local_unnamed_addr

declare zeroext i1 @_FortranAioOutputInteger32(ptr, i32) local_unnamed_addr

declare i32 @_FortranAioEndIoStatement(ptr) local_unnamed_addr

; Function Attrs: nounwind
declare i32 @__kmpc_global_thread_num(ptr) local_unnamed_addr #0

; Function Attrs: nounwind
declare !callback !8 void @__kmpc_fork_call(ptr, i32, ptr, ...) local_unnamed_addr #0

attributes #0 = { nounwind }
attributes #1 = { norecurse nounwind }

!llvm.module.flags = !{!0, !1, !2, !3}

!0 = !{i32 2, !"Debug Info Version", i32 3}
!1 = !{i32 7, !"openmp", i32 11}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"PIE Level", i32 2}
!4 = !{!5, !5, i64 0}
!5 = !{!"any data access", !6, i64 0}
!6 = !{!"any access", !7, i64 0}
!7 = !{!"Flang Type TBAA Root"}
!8 = !{!9}
!9 = !{i64 2, i64 -1, i64 -1, i1 true}

llvm / llvm-project

Questions about preventing OpenMP OpenMP parallel regions. #63136