OpenMP Task Implementation

eZWALT commented 5 months ago

I'm opening this issue to consult some implementation details with the community before proceding with a possible pull request. I will add another class called CIRClauseProcessor to avoid repetition of clause treatment inside the OpenMP statements code generation and also some additional methods to generate the body to also avoid redundancy. I've gathered some inspiration by how OpenMP is being implemented in flang, given that they also use MLIR.

Furthermore, I have a local version that implements scopes and untied/mergeable clauses, but one important issue that I should address is that ClangIR has no support for integers of bit width 1, and I pretty much need them for some clauses inside of omp task such as final, if ...

mlir::omp::TaskOp operands and attributes:

I've tried some workarounds, such as one below, but still produces compilation errors due to this type not being supported on ClangIR.

  //auto uint1Ty = mlir::cir::IntType::get(builder.getContext(),/*Bit Width*/ 1,/*Signed?*/false );

The error i've been experiencing is the following (Obviously ClangIR only supports a bit width of 8,16,32...)

If i try to use mlir::IntegerType of bit width 1, I get the following error due to "cir.const" operation not supporting this type

So the question that is to be determined is: Should I implement support for 1 bit integers (I don't think this is a good approach though) or should I use mlir.const operations instead of cir.const (This may be problematic too, due to the mixture of mlir and cir). I'm not completely sure on how to proceed, so your experience and advice would be of great help, thank you!

bcardosolopes commented 5 months ago

Hi @eZWALT, these are great questions, since it exposes the challenges of mixing dialects. Can you write down what your input source file looks like and how you expect the dialect mix output to look like? (Or point out the options in this imaginary example?). It might be easier to discuss based on an actual mockup.

eZWALT commented 5 months ago

My apologies for not explaining with enough detail, I was facing errors due to using constants in compile-time, but now I'm facing issues related to the build operations returning CIR types instead of MLIR types .This is the basic processing of the clauses to get more context (Now outdated since I've built a CIRClauseProcessor class to avoid redundant code, you will get to see it in the PR):


if (S.hasClausesOfKind<OMPUntiedClause>())
    untiedAttr = builder.getUnitAttr();
if (S.hasClausesOfKind<OMPMergeableClause>())
    mergeableAttr = builder.getUnitAttr();

// Evaluates and set operands: the priority, if, and final clauses
if (S.hasClausesOfKind<OMPFinalClause>()) {
    const OMPFinalClause *finalClause = S.getSingleClause<OMPFinalClause>();
    const clang::Expr *condExpr = finalClause->getCondition();
    finalOperand = evaluateExprAsBool(condExpr);
}

if (S.hasClausesOfKind<OMPIfClause>()) {
    const OMPIfClause *ifClause = S.getSingleClause<OMPIfClause>();
    const clang::Expr *condExpr = ifClause->getCondition();
    ifOperand = evaluateExprAsBool(condExpr);
}

if (S.hasClausesOfKind<OMPPriorityClause>()) {
    const OMPPriorityClause *priorityClause = S.getSingleClause<OMPPriorityClause>();
    const clang::Expr *priorityExpr = priorityClause->getPriority();
    priorityOperand = buildScalarExpr(priorityExpr);
}

I'm reusing the already written code that generates the boolean and scalar expressions, such as buildScalarExpr and evaluateExprAsBool , but I'm facing troubles with the operands' priority, if and final, since this must use mlir types and not cir types. The approach that I thought that can solve this problem would be to add a new conversion operation (I'm well aware after reading the codegen classes in further detail, that I could achieve this using a modified version of buildScalarConversion, given that it only supports conversion of Clang QualType's).

So I propose, for the sake of simplicity, extending or duplicating buildScalarConversion to support MLIR types. I've seen that Flang follows a similar approach, they have a conversion operation too, to cast these FIR types into MLIR, so that me think that this approach may be feasible. However, this has a downside, the instruction overhead, a new redundant instruction is created.

I lack sufficient experience with MLIR and familiarity with ClangIR as a project to confidently choose this direction. I'm open to exploring other approaches that might be equally effective or even better suited to address my issue. I would greatly appreciate hearing the opinions of @fabianmcg and @kiranchandramohan on this matter. Thank you to everyone for your input!"

bcardosolopes commented 5 months ago

The approach that I thought that can solve this problem would be to add a new conversion operation

That's a possibility, are you aware of https://mlir.llvm.org/docs/Dialects/Builtin/#builtinunrealized_conversion_cast-unrealizedconversioncastop ?

I lack sufficient experience with MLIR and familiarity with ClangIR as a project to confidently choose this direction

Can you paste the original source file you are trying to build (I'm assuming it's a very basic one) and write down what CIR do you imagine being generated? I think I can provide more design help or insights if I can look at that.

eZWALT commented 4 months ago

I wasn't aware of this unrealized conversion, now after implementing it and testing it I can say that this is far more practical and cleaner than building a new conversion operation from scratch. Now, I will put forward the original source file that I'm using as a basic example for testing the final clause (Remember that the treatment of final, if and priority clauses is symmetrical):


int main(){
        int b = 3;
        int a = -3;
        #pragma omp parallel 
        {
                int condition = (a+b) == 0;
                #pragma omp task final(condition)
                {
                        int a = 33;
                        int * c = &a;
                }
        }
}

The output after compiling this file with the basic flags (-fopenmp -fclangir-enable -emit-cir) is the following (which is almost the same that I imagined on my head when writing the code generation):

  cir.func no_proto @main() -> !s32i extra(#fn_attr) {
    %0 = cir.alloca !s32i, cir.ptr <!s32i>, ["__retval"] {alignment = 4 : i64} loc(#loc2)
    %1 = cir.alloca !s32i, cir.ptr <!s32i>, ["b", init] {alignment = 4 : i64} loc(#loc23)
    %2 = cir.alloca !s32i, cir.ptr <!s32i>, ["a", init] {alignment = 4 : i64} loc(#loc24)
    %3 = cir.const(#cir.int<3> : !s32i) : !s32i loc(#loc4)
    cir.store %3, %1 : !s32i, cir.ptr <!s32i> loc(#loc23)
    %4 = cir.const(#cir.int<3> : !s32i) : !s32i loc(#loc6)
    %5 = cir.unary(minus, %4) : !s32i, !s32i loc(#loc7)
    cir.store %5, %2 : !s32i, cir.ptr <!s32i> loc(#loc24)
    omp.parallel {
      cir.scope {
        %7 = cir.alloca !s32i, cir.ptr <!s32i>, ["condition", init] {alignment = 4 : i64} loc(#loc26)
        %8 = cir.load %2 : cir.ptr <!s32i>, !s32i loc(#loc12)
        %9 = cir.load %1 : cir.ptr <!s32i>, !s32i loc(#loc13)
        %10 = cir.binop(add, %8, %9) : !s32i loc(#loc27)
        %11 = cir.const(#cir.int<0> : !s32i) : !s32i loc(#loc11)
        %12 = cir.cmp(eq, %10, %11) : !s32i, !s32i loc(#loc28)
        cir.store %12, %7 : !s32i, cir.ptr <!s32i> loc(#loc26)
        %13 = cir.load %7 : cir.ptr <!s32i>, !s32i loc(#loc15)
        %14 = cir.cast(int_to_bool, %13 : !s32i), !cir.bool loc(#loc15)
        %15 = builtin.unrealized_conversion_cast %14 : !cir.bool to i1 loc(#loc29)
        omp.task final(%15) {
          cir.scope {
            %16 = cir.alloca !s32i, cir.ptr <!s32i>, ["a", init] {alignment = 4 : i64} loc(#loc30)
            %17 = cir.alloca !cir.ptr<!s32i>, cir.ptr <!cir.ptr<!s32i>>, ["c", init] {alignment = 8 : i64} loc(#loc31)
            %18 = cir.const(#cir.int<33> : !s32i) : !s32i loc(#loc19)
            cir.store %18, %16 : !s32i, cir.ptr <!s32i> loc(#loc30)
            cir.store %16, %17 : !cir.ptr<!s32i>, cir.ptr <!cir.ptr<!s32i>> loc(#loc31)
          } loc(#loc29)
          omp.terminator loc(#loc17)
        } loc(#loc29)
      } loc(#loc25)
      omp.terminator loc(#loc9)
    } loc(#loc25)
    %6 = cir.load %0 : cir.ptr <!s32i>, !s32i loc(#loc2)
    cir.return %6 : !s32i loc(#loc2)
  } loc(#loc22)
} loc(#loc)

Now the only issue is that if I want to compile to a backend, for example to LLVM, I get a crash (Makes sense given that I haven't implemented how this conversion is actually performed) like the following:

clang-19 -fopenmp -fclangir-enable  -S -emit-llvm 9_task_final.c -o -

loc(fused["9_task_final.c":8:3, "9_task_final.c":8:36]): error: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal
fatal error: error in backend: The pass manager failed to lower CIR to LLVMIR dialect!

So my right now I'm kind of clueless of how could I implement this lowering, or maybe I misunderstood how unrealized_conversion_cast works and it serves as a placeholder. Thanks for reading and I'm eager to hear your feedback!

bcardosolopes commented 4 months ago

Thanks for pasting the example, I think it makes sense too.

So my right now I'm kind of clueless of how could I implement this lowering, or maybe I misunderstood how unrealized_conversion_cast works and it serves as a placeholder.

It's possible that this requires adding something like target->addLegalOp<UnrealizedConversionCastOp>();, as I see others doing (e.g. MathToSPIRVPass.cpp, etc). Once that's done, my expectation is that this op will become legal and will show up in the final LLVM IR dialect, with all types in terms of LLVM, and later folded away.

eZWALT commented 4 months ago

Great, then I will try it out today later on. Now moving up to dependency management inside the task directive, I need some way to get the address of a local/global variable inside a mlir::Value, in order to process dependencies between tasks. Being more specific, I require a way to pragma omp task depend(in: X) depend(out: Y) get the addresses of variables X and Y. I' aware that CIR has this operation "get_global" (https://llvm.github.io/clangir/Dialect/ops.html#cirget_global-cirgetglobalop) which returns the address that points to the specified symbol. To complete dependency management, I would need to perform a similar operation. I'm saying this because I guess that the static information that I can get from the AST DeclRefExpr is not enough. Below, I've attached the lambda function that I'm trying to build to get both the type and address of the depend clause:

    [&](const clang::OMPDependClause* clause){
      //Get the depend type
      mlir::omp::ClauseTaskDependAttr dependType = getDependKindAttr(
        this->builder, clause
      )
      //Get an mlir value of the address of the depend variable
      const mlir::Value variable = builder.create<cir::get_local> ???
      result.dependVars.append(variable);
      result.dependTypeAttrs.append(dependType);
    }

So my question is, how could I achieve this? Any ideas are welcome!

bcardosolopes commented 4 months ago

I require a way to pragma omp task depend(in: X) depend(out: Y) get the addresses of variables X and Y.

Are this required to be globals? or could it be address coming from local variables?

I suggest you write code in C/C++ and check how we currently do things in lib/CIR/CodeGen by running a simple test under the debugger.

Btw, this is an example of a global lowered to CIR: https://godbolt.org/z/6rs83saes. You could also grep the codebase for the operation name to see how it's usually constructed (.e.g git grep GetGlobalOp)

eZWALT commented 4 months ago

No, these variables can be both local or global, so that's why I was asking for a way to achieve the same as the GetGlobalOp for local variables or any kind of variable. But if there isn't a straightforward way to do this with operations actually, I will look up the actual implementation of GetGlobalOp, thanks !

bcardosolopes commented 4 months ago

Local variables already have their value produced by cir.alloca and/or a series of casts that might get in the middle. Can you point me to the OG LLVM code you are trying to replicate here? Depending on what gets called you could call the same method and get transparently solved by other pre-existing CIRGen component

eZWALT commented 4 months ago

Okay, so take for instance this basic sample of task dependency usage:

int main(){
    int x = 5;
    int y = 413;

    #pragma omp parallel 
    {
        #pragma omp task depend(out: x,y)
        {
            ++x;
        }
        #pragma omp task depend(in: x,y)
        {
            --x;
        }
        int * c = &x;
    }
}

Now, looking at the outlined main of its LLVM counterpart after compiling:

; Function Attrs: noinline norecurse nounwind optnone uwtable
define internal void @main.omp_outlined(ptr noalias noundef %.global_tid., ptr noalias noundef %.bound_tid., ptr noundef nonnull align 4 dereferenceable(4) %x, ptr noundef nonnull align 4 dereferenceable(4) %y) #1 {
entry:
  %.global_tid..addr = alloca ptr, align 8
  %.bound_tid..addr = alloca ptr, align 8
  %x.addr = alloca ptr, align 8
  %y.addr = alloca ptr, align 8
  %agg.captured = alloca %struct.anon, align 8
  %.dep.arr.addr = alloca [2 x %struct.kmp_depend_info], align 8
  %dep.counter.addr = alloca i64, align 8
  %agg.captured1 = alloca %struct.anon.0, align 8
  %.dep.arr.addr2 = alloca [2 x %struct.kmp_depend_info], align 8
  %dep.counter.addr3 = alloca i64, align 8
  %c = alloca ptr, align 8
  store ptr %.global_tid., ptr %.global_tid..addr, align 8
  store ptr %.bound_tid., ptr %.bound_tid..addr, align 8
  store ptr %x, ptr %x.addr, align 8
  store ptr %y, ptr %y.addr, align 8
  %0 = load ptr, ptr %x.addr, align 8
  %1 = load ptr, ptr %y.addr, align 8
  %2 = getelementptr inbounds %struct.anon, ptr %agg.captured, i32 0, i32 0
  store ptr %0, ptr %2, align 8
  %3 = load ptr, ptr %.global_tid..addr, align 8
  %4 = load i32, ptr %3, align 4
  %5 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %4, i32 1, i64 40, i64 8, ptr @.omp_task_entry.)
  %6 = getelementptr inbounds %struct.kmp_task_t_with_privates, ptr %5, i32 0, i32 0
  %7 = getelementptr inbounds %struct.kmp_task_t, ptr %6, i32 0, i32 0
  %8 = load ptr, ptr %7, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %8, ptr align 8 %agg.captured, i64 8, i1 false)
  %9 = getelementptr inbounds [2 x %struct.kmp_depend_info], ptr %.dep.arr.addr, i64 0, i64 0
  %10 = ptrtoint ptr %0 to i64
  %11 = getelementptr %struct.kmp_depend_info, ptr %9, i64 0
  %12 = getelementptr inbounds %struct.kmp_depend_info, ptr %11, i32 0, i32 0
  store i64 %10, ptr %12, align 8
  %13 = getelementptr inbounds %struct.kmp_depend_info, ptr %11, i32 0, i32 1
  store i64 4, ptr %13, align 8
  %14 = getelementptr inbounds %struct.kmp_depend_info, ptr %11, i32 0, i32 2
  store i8 3, ptr %14, align 8
  %15 = ptrtoint ptr %1 to i64
  %16 = getelementptr %struct.kmp_depend_info, ptr %9, i64 1
  %17 = getelementptr inbounds %struct.kmp_depend_info, ptr %16, i32 0, i32 0
  store i64 %15, ptr %17, align 8
  %18 = getelementptr inbounds %struct.kmp_depend_info, ptr %16, i32 0, i32 1
  store i64 4, ptr %18, align 8
  %19 = getelementptr inbounds %struct.kmp_depend_info, ptr %16, i32 0, i32 2
  store i8 3, ptr %19, align 8
  store i64 2, ptr %dep.counter.addr, align 8
  %20 = call i32 @__kmpc_omp_task_with_deps(ptr @1, i32 %4, ptr %5, i32 2, ptr %9, i32 0, ptr null)
  %21 = getelementptr inbounds %struct.anon.0, ptr %agg.captured1, i32 0, i32 0
  store ptr %0, ptr %21, align 8
  %22 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %4, i32 1, i64 40, i64 8, ptr @.omp_task_entry..2)
  %23 = getelementptr inbounds %struct.kmp_task_t_with_privates.1, ptr %22, i32 0, i32 0
  %24 = getelementptr inbounds %struct.kmp_task_t, ptr %23, i32 0, i32 0
  %25 = load ptr, ptr %24, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %25, ptr align 8 %agg.captured1, i64 8, i1 false)
  %26 = getelementptr inbounds [2 x %struct.kmp_depend_info], ptr %.dep.arr.addr2, i64 0, i64 0
  %27 = ptrtoint ptr %0 to i64
  %28 = getelementptr %struct.kmp_depend_info, ptr %26, i64 0
  %29 = getelementptr inbounds %struct.kmp_depend_info, ptr %28, i32 0, i32 0
  store i64 %27, ptr %29, align 8
  %30 = getelementptr inbounds %struct.kmp_depend_info, ptr %28, i32 0, i32 1
  store i64 4, ptr %30, align 8
  %31 = getelementptr inbounds %struct.kmp_depend_info, ptr %28, i32 0, i32 2
  store i8 1, ptr %31, align 8
  %32 = ptrtoint ptr %1 to i64
  %33 = getelementptr %struct.kmp_depend_info, ptr %26, i64 1
  %34 = getelementptr inbounds %struct.kmp_depend_info, ptr %33, i32 0, i32 0
  store i64 %32, ptr %34, align 8
  %35 = getelementptr inbounds %struct.kmp_depend_info, ptr %33, i32 0, i32 1
  store i64 4, ptr %35, align 8
  %36 = getelementptr inbounds %struct.kmp_depend_info, ptr %33, i32 0, i32 2
  store i8 1, ptr %36, align 8
  store i64 2, ptr %dep.counter.addr3, align 8
  %37 = call i32 @__kmpc_omp_task_with_deps(ptr @1, i32 %4, ptr %22, i32 2, ptr %26, i32 0, ptr null)
  store ptr %0, ptr %c, align 8
  ret void
}

Despite the differences with original Clang code generation (The capture list variables addresses are placed inside an struct which will be passed down to the openmp calls), I want to do something similar with MLIR. Basically, I need to obtain the addresses of both X,Y variables. After looking at the Clang AST produced (Compiler Explorer AST ), I've concluded that I should do a "load address"-like operation (&x), which I think it's LValue expr in C++ parlance (Correct me if I'm wrong, probably I'm and its far more complex).

Looking at the Code Gen functions that are currently supported in CIR (consulting CIRGenFunction.h), I was looking to a load operation and came to the conclusion that a combination of buildLValue and some other operation like buildLoadOfLValue that returns an mlir::Value of the address of the LValue. I'm aware that the last operation i've mentioned actually returns an RValue. Maybe I could statically assign the value of lvalue.getAddress() to an mlir value, but I don't think that would be correct. I'm probably missing something because I'm overcomplicating a simple address access.

Thank you and sorry for the long message!

eZWALT commented 3 months ago

@bcardosolopes I wanted to give updates on the local changes that I've got surrounding the implementation of the task directive. There were 2 issues stoping me from uploading a pull request:

-1. Uncertainty on how to convert CIR types into MLIR basic types (Such as the cir.bool (i8) -> i1 conversion). I've managed to solve this. While it is true that unrealized cast conversions helped out, specially when the conversion was something along the lines of the cir.int -> i32 (In the priority clause), it wasn't enough for the boolean conversions, so after researching MLIR passes/conversions and understanding the CIR lowering codebase, I've implemented a workaround for this using the Arith dialect. This dialect has integer operations such as truncIOP (Truncate integer) which basically transforms one type (i8) to a smaller type (i1). This was added to the Direct CIR->LLVMIR workflow, therefore adding to the pass the translation of Arith-LLVMIR and a Rewrite pattern to the UnrealizedConversionCast operation, if the length of the integers mismatch then this arith operation is added. Maybe this isn't the most suited approach, but I've found out that this reduces the burden of adding a completely new operation such as cir.truncINT... For my thesis I will present this approach, but for the upstreaming of task we can discuss which of the 2 approaches can be more beneficial in the long-term for CIR.

-2. The implementation of the DependClause, which I'm stuck due to not knowing the actual type of the depend vars. MLIR OMP documentation does not provide extensive explanation on the types of the parameters, so I'm kind of confused. Right now what I'm doing is performing a buildLValue + buildLoadOfScalar from the DeclRefExpr's that can be found on the AST of Depend clauses, but somehow I get this error:

error: 'omp.task' op operand #0 must be variadic of OpenMP-compatible variable type, but got 'i32'

So next week I will be uploading some PR's, starting with a simple omp.task with no clauses, and after that I will be uploading the processing of the clauses incrementally. I still need some help on the depend clause, so If someone with experience in MLIR types and conversions can help me out, I would appreciate it!

kiranchandramohan commented 3 months ago

-2. The implementation of the DependClause, which I'm stuck due to not knowing the actual type of the depend vars. MLIR OMP documentation does not provide extensive explanation on the types of the parameters, so I'm kind of confused. Right now what I'm doing is performing a buildLValue + buildLoadOfScalar from the DeclRefExpr's that can be found on the AST of Depend clauses, but somehow I get this error:

At the moment the representation for depend allows any Pointer like type. This can be addded to the relevant types for variables and array sections in ClangIR. It will be good to know if this will not be sufficient and a token like mechanism is required.

                       Variadic<OpenMP_PointerLikeType>:$depend_vars,

https://github.com/llvm/llvm-project/blob/f50656c50938781fdfdabe05c9225783ce7e409d/flang/lib/Optimizer/Dialect/FIRType.cpp#L1355 https://github.com/llvm/llvm-project/blob/f50656c50938781fdfdabe05c9225783ce7e409d/flang/include/flang/Tools/PointerModels.h#L19

@bhandarkar-pranav is currently working on task and target dependencies. He might be able to help you.

eZWALT commented 3 months ago

-2. The implementation of the DependClause, which I'm stuck due to not knowing the actual type of the depend vars. MLIR OMP documentation does not provide extensive explanation on the types of the parameters, so I'm kind of confused. Right now what I'm doing is performing a buildLValue + buildLoadOfScalar from the DeclRefExpr's that can be found on the AST of Depend clauses, but somehow I get this error:

At the moment the representation for depend allows any Pointer like type. This can be addded to the relevant types for variables and array sections in ClangIR. It will be good to know if this will not be sufficient and a token like mechanism is required.
                       Variadic<OpenMP_PointerLikeType>:$depend_vars,
https://github.com/llvm/llvm-project/blob/f50656c50938781fdfdabe05c9225783ce7e409d/flang/lib/Optimizer/Dialect/FIRType.cpp#L1355 https://github.com/llvm/llvm-project/blob/f50656c50938781fdfdabe05c9225783ce7e409d/flang/include/flang/Tools/PointerModels.h#L19

@bhandarkar-pranav is currently working on task and target dependencies. He might be able to help you.

Thank you, Kiran, So I assume this is an interface to a type similar to LLVM::LLVMPointerType or maybe MemRefType right?

@bhandarkar-pranav if you feel like contributing could you please contact me at walter.jose.troiani@estudiantat.upc.edu so we can talk about this matter?

kiranchandramohan commented 3 months ago

Thank you, Kiran, So I assume this is an interface to a type similar to LLVM::LLVMPointerType or maybe MemRefType right?

OpenMP_PointerLikeType is kind of an interface. We have to implement that interface for the types we are interested in. Basically in this case, there should be a function that provides getElementType.

We add these to Memref and LLVMPointerTypes in the OpenMPDialect files. For FIR reference/pointer types we add that in the flang codebase. https://github.com/llvm/llvm-project/blob/8ab3f8ae0d39048e4bc1198514049813c6765fb6/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp#L60 https://github.com/llvm/llvm-project/blob/8ab3f8ae0d39048e4bc1198514049813c6765fb6/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp#L52

llvm / clangir

OpenMP Task Implementation #525