`boundary_flux` kernel failed for 3D

Kernels of boundary flux can work for 1D and 2D but failed for 3D. Here is the 3D kernel https://github.com/huiyuxie/trixi_cuda/blob/81bdfcf1abaecaf64badcdb77262bf4ede791d49/cuda_dg_3d.jl#L450-L488 and I suspect the problem comes from below as when I commented starting from this line then the kernel could be compiled with success.

boundary_condition = boundary_conditions[direction]

Also the errors are given as

ERROR: InvalidIRError: compiling MethodInstance for boundary_flux_kernel!(::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 4, 1}, ::Float64, ::CuDeviceVector{Int32, 1}, ::CuDeviceVector{Int32, 1}, ::CuDeviceVector{Int32, 1}, ::CuDeviceVector{Int32, 1}, ::CuDeviceVector{Int32, 1}, ::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, ::HyperbolicDiffusionEquations3D{Float32}, ::FluxLaxFriedrichs{typeof(max_abs_speed_naive)}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./namedtuple.jl:136
 [2] boundary_flux_kernel!
   @ ~/trixi_cuda/cuda_dg_3d.jl:477
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] boundary_flux_kernel!
   @ ~/trixi_cuda/cuda_dg_3d.jl:478
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:411 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:410 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:118
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:92 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:88
 [10] compile
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:79 [inlined]
 [11] compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:125
 [12] #1032
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:120 [inlined]
 [13] LLVM.ThreadSafeContext(f::CUDA.var"#1032#1033"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ LLVM ~/.julia/packages/LLVM/5aiiG/src/executionengine/ts_module.jl:14
 [14] JuliaContext
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:35 [inlined]
 [15] compile
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:119 [inlined]
 [16] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:125
 [17] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:103
 [18] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:318 [inlined]
 [19] macro expansion
    @ ./lock.jl:267 [inlined]
 [20] cufunction(f::typeof(boundary_flux_kernel!), tt::Type{Tuple{CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 4, 1}, Float64, CuDeviceVector{Int32, 1}, CuDeviceVector{Int32, 1}, CuDeviceVector{Int32, 1}, CuDeviceVector{Int32, 1}, CuDeviceVector{Int32, 1}, NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, HyperbolicDiffusionEquations3D{Float32}, FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:313
 [21] cufunction
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:310 [inlined]
 [22] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:104 [inlined]
 [23] cuda_boundary_flux!(t::Float64, mesh::TreeMesh{3, SerialTree{3}}, boundary_conditions::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, equations::HyperbolicDiffusionEquations3D{Float32}, dg::DGSEM{LobattoLegendreBasis{Float64, 5, SVector{5, Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, LobattoLegendreMortarL2{Float64, 5, Matrix{Float64}, Matrix{Float64}}, SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}, VolumeIntegralWeakForm}, cache::NamedTuple{(:elements, :interfaces, :boundaries, :mortars, :fstar_upper_left_threaded, :fstar_upper_right_threaded, :fstar_lower_left_threaded, :fstar_lower_right_threaded, :fstar_tmp1_threaded), Tuple{ElementContainer3D{Float64, Float64}, InterfaceContainer3D{Float64}, BoundaryContainer3D{Float64, Float64}, L2MortarContainer3D{Float64}, Vararg{Vector{Array{Float64, 3}}, 5}}})
    @ Main ~/trixi_cuda/cuda_dg_3d.jl:523
 [24] top-level scope
    @ ~/trixi_cuda/cuda_dg_3d.jl:675

Anything that could help is welcome!

By the way, at first I thought the issue was because of the Integer type so I tried with both Int32 and Int64 throughout the whole cuda_dg_3d.jl file (no change in other files) but still caused the same errors.

Yeah, that can be problematic. If I understand this correctly, you are looping over all boundaries in one call. However, we allow having different boundary conditions. For example, we could use one function for the two boundaries in x direction and another function for the two boundaries in y direction. In the kernel, you compute the direction, so the compiler just knows that it will be an Int. I assume it works since it can again use union splitting and compiler heuristics in 1D and 2D but fails in 3D since the assignment boundary_condition = boundary_conditions[direction] is a call to the type-unstable function getindex for a NamedTuple with heterogeneous types. The only way to make this type-stable I see is to split it into multiple kernel calls - one for each ~boundary~ direction (-x, +x, -y, +y, -z, +z in 3D, encoded as direction = 1, direction = 2, ..., direction = 6).

Wow, thank you for replying to me on the weekend! Sorry for the late reply as I stopped checking repo over the weekend.

However, we allow having different boundary conditions. For example, we could use one function for the two boundaries in x direction and another function for the two boundaries in y direction.

Yes, that is why I call this boundary_condition = boundary_conditions[direction] inside CUDA kernel.

The only way to make this type-stable I see is to split it into multiple kernel calls - one for each boundary.

Do you mean creating kernel for each boundary (i.e. the boundary variable from here https://github.com/huiyuxie/Trixi.jl/blob/fdccbb17aaddd31a935ee7560fd4ea506a6d93ad/src/solvers/dgsem_tree/dg_3d.jl#L810)? But that is too much and definitely cost much more time. Is there any other way to make it into one call?

Also, I am a little bit confused about

In the kernel, you compute the direction, so the compiler just knows that it will be an Int.

As in the original implementation, the direction is also of Int type (https://github.com/huiyuxie/Trixi.jl/blob/fdccbb17aaddd31a935ee7560fd4ea506a6d93ad/src/solvers/dgsem_tree/dg_3d.jl#L781-L799). There is no difference between CPU code and GPU code with regard to direction type. Is there any other type direction could be? Thanks!

What I meant is that direction is just known to be an Int everywhere - but boundary_conditions is a possibly heterogeneous tuple, so accessing boundary_condition = boundary_conditions[direction] is not type-stable if you only know that direction is an Int. You would need to know the value of direction to know the type of boundary_conditions[direction].

Do you mean creating kernel for each boundary (i.e. the boundary variable from here https://github.com/huiyuxie/Trixi.jl/blob/fdccbb17aaddd31a935ee7560fd4ea506a6d93ad/src/solvers/dgsem_tree/dg_3d.jl#L810)? But that is too much and definitely cost much more time. Is there any other way to make it into one call?

No, I was thinking more about the same splitting that we do in https://github.com/huiyuxie/Trixi.jl/blob/fdccbb17aaddd31a935ee7560fd4ea506a6d93ad/src/solvers/dgsem_tree/dg_3d.jl#L781 There, we call one "kernel" for each direction. Sorry for not being clear in the post above.

No worries and thanks for the clarification! I guess what you meant is splitting kernels based on different directions (two kernels for 1D, four for 2D, and six for 3D), which was exactly what I wanted to implement at first. But I think making them into one kernel call should be better in performance. If there is no alternative, I will go ahead with it, even though I don't like implementing in this way.

I guess what you meant is splitting kernels based on different directions (two kernels for 1D, four for 2D, and six for 3D), which was exactly what I wanted to implement at first.

Yes, exactly!

Update: some benchmark results for both original (i.e., one kernel call) and new (i.e., multiple kernel calls)

# 1D original 
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  164.444 μs … 188.238 ms  ┊ GC (min … max): 0.00% … 13.02%
 Time  (median):     176.639 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   197.488 μs ±   1.881 ms  ┊ GC (mean ± σ):  1.24% ±  0.13%

     ▅█▆▄                  ▁▂▁                                   
  ▁▃██████▅▄▅▅▄▃▂▂▂▂▂▂▄▅▆▆▇███▆▅▄▄▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  164 μs           Histogram: frequency by time          213 μs <

 Memory estimate: 8.27 KiB, allocs estimate: 178.

 # 1D new
 BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  176.074 μs … 165.794 ms  ┊ GC (min … max): 0.00% … 12.84%
 Time  (median):     188.641 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   224.706 μs ±   2.326 ms  ┊ GC (mean ± σ):  1.98% ±  0.19%

       ▁█▆▂   ▁▄▅▄                                               
  ▁▂▂▃▅█████▇██████▅▃▃▃▃▂▃▅▇▇▆▅▄▃▄▄▄▄▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  176 μs           Histogram: frequency by time          229 μs <

 Memory estimate: 10.61 KiB, allocs estimate: 222.

# 2D original 
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  184.924 μs …  27.644 ms  ┊ GC (min … max): 0.00% … 25.11%
 Time  (median):     198.877 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   227.313 μs ± 842.936 μs  ┊ GC (mean ± σ):  2.92% ±  0.78%

       ▁▃▇▆▆▄▄▄▃▅▇██▇▅▅▄                                         
  ▁▁▂▃▇██████████████████▆▅▅▄▄▄▃▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▁▁▁▁▁▁▁ ▄
  185 μs           Histogram: frequency by time          238 μs <

 Memory estimate: 74.34 KiB, allocs estimate: 194.

 # 2D new
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  240.740 μs … 33.349 ms  ┊ GC (min … max): 0.00% … 23.65%
 Time  (median):     258.245 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   305.800 μs ±  1.220 ms  ┊ GC (mean ± σ):  3.57% ±  0.89%

            ▁▄▇█▆▆▅▆▆█▆▆▄▂▁                                     
  ▁▂▅▆▆▆▆▆▅▆████████████████▆▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▂▂▂▁▁▁▁▁▁ ▄
  241 μs          Histogram: frequency by time          300 μs <

 Memory estimate: 82.05 KiB, allocs estimate: 334.

There are still some errors with 3D new kernel and I will update them later.

In the case of 3D, I tried both making direction as a variable (i.e., one kernel prototype) like here in 1D and 2D https://github.com/huiyuxie/trixi_cuda/blob/e23e596d8c71364fe8d22239c80c762f87153103/cuda_dg_1d.jl#L375-L404 https://github.com/huiyuxie/trixi_cuda/blob/e23e596d8c71364fe8d22239c80c762f87153103/cuda_dg_2d.jl#L428-L458 but caused the same errors as mentioned above, see

ERROR: InvalidIRError: compiling MethodInstance for boundary_flux_kernel!(::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 4, 1}, ::Float64, ::CuDeviceVector{Int64, 1}, ::Int64, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, ::HyperbolicDiffusionEquations3D{Float32}, ::FluxLaxFriedrichs{typeof(max_abs_speed_naive)}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./namedtuple.jl:136
 [2] boundary_flux_kernel!
   @ ~/trixi_cuda/cuda_dg_3d.jl:473
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] boundary_flux_kernel!
   @ ~/trixi_cuda/cuda_dg_3d.jl:474
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:411 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:410 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:118
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:92 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:88
 [10] compile
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:79 [inlined]
 [11] compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:125
 [12] #1032
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:120 [inlined]
 [13] LLVM.ThreadSafeContext(f::CUDA.var"#1032#1033"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ LLVM ~/.julia/packages/LLVM/5aiiG/src/executionengine/ts_module.jl:14
 [14] JuliaContext
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:35 [inlined]
 [15] compile
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:119 [inlined]
 [16] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:125
 [17] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:103
 [18] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:318 [inlined]
 [19] macro expansion
    @ ./lock.jl:267 [inlined]
 [20] cufunction(f::typeof(boundary_flux_kernel!), tt::Type{Tuple{CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 4, 1}, Float64, CuDeviceVector{Int64, 1}, Int64, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, HyperbolicDiffusionEquations3D{Float32}, FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:313
 [21] cufunction
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:310 [inlined]
 [22] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:104 [inlined]
 [23] cuda_boundary_flux!(t::Float64, mesh::TreeMesh{3, SerialTree{3}}, boundary_conditions::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, equations::HyperbolicDiffusionEquations3D{Float32}, dg::DGSEM{LobattoLegendreBasis{Float64, 5, SVector{5, Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, LobattoLegendreMortarL2{Float64, 5, Matrix{Float64}, Matrix{Float64}}, SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}, VolumeIntegralWeakForm}, cache::NamedTuple{(:elements, :interfaces, :boundaries, :mortars, :fstar_upper_left_threaded, :fstar_upper_right_threaded, :fstar_lower_left_threaded, :fstar_lower_right_threaded, :fstar_tmp1_threaded), Tuple{ElementContainer3D{Float64, Float64}, InterfaceContainer3D{Float64}, BoundaryContainer3D{Float64, Float64}, L2MortarContainer3D{Float64}, Vararg{Vector{Array{Float64, 3}}, 5}}})
    @ Main ~/trixi_cuda/cuda_dg_3d.jl:560
 [24] macro expansion
    @ ~/trixi_cuda/cuda_dg_3d.jl:784 [inlined]
 [25] var"##core#6648"()
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:489
 [26] var"##sample#6649"(::Tuple{}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:495
 [27] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:99
 [28] #invokelatest#2
    @ ./essentials.jl:818 [inlined]
 [29] invokelatest
    @ ./essentials.jl:813 [inlined]
 [30] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [31] run_result
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [32] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117
 [33] run (repeats 2 times)
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117 [inlined]
 [34] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:169 [inlined]
 [35] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:168
 [36] top-level scope
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:393

So I turned to the way that explicitly specifies the value of direction (i.e., I have to create six different kernel prototypes for 3D) here and this is a really troublesome way https://github.com/huiyuxie/trixi_cuda/blob/e23e596d8c71364fe8d22239c80c762f87153103/cuda_dg_3d.jl#L450-L653 but still caused problems. The errors are slightly different this time, see

ERROR: InvalidIRError: compiling MethodInstance for boundary_flux_kernel3!(::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 4, 1}, ::Float64, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, ::HyperbolicDiffusionEquations3D{Float32}, ::FluxLaxFriedrichs{typeof(max_abs_speed_naive)}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to boundary_condition_periodic)
Stacktrace:
 [1] boundary_flux_kernel3!
   @ ~/trixi_cuda/cuda_dg_3d.jl:541
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:411 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:410 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:118
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:92 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:88
 [10] compile
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:79 [inlined]
 [11] compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:125
 [12] #1032
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:120 [inlined]
 [13] LLVM.ThreadSafeContext(f::CUDA.var"#1032#1033"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ LLVM ~/.julia/packages/LLVM/5aiiG/src/executionengine/ts_module.jl:14
 [14] JuliaContext
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:35 [inlined]
 [15] compile
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:119 [inlined]
 [16] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:125
 [17] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:103
 [18] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:318 [inlined]
 [19] macro expansion
    @ ./lock.jl:267 [inlined]
 [20] cufunction(f::typeof(boundary_flux_kernel3!), tt::Type{Tuple{CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 4, 1}, Float64, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, HyperbolicDiffusionEquations3D{Float32}, FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:313
 [21] cufunction
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:310 [inlined]
 [22] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:104 [inlined]
 [23] cuda_boundary_flux!(t::Float64, mesh::TreeMesh{3, SerialTree{3}}, boundary_conditions::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, equations::HyperbolicDiffusionEquations3D{Float32}, dg::DGSEM{LobattoLegendreBasis{Float64, 5, SVector{5, Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, LobattoLegendreMortarL2{Float64, 5, Matrix{Float64}, Matrix{Float64}}, SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}, VolumeIntegralWeakForm}, cache::NamedTuple{(:elements, :interfaces, :boundaries, :mortars, :fstar_upper_left_threaded, :fstar_upper_right_threaded, :fstar_lower_left_threaded, :fstar_lower_right_threaded, :fstar_tmp1_threaded), Tuple{ElementContainer3D{Float64, Float64}, InterfaceContainer3D{Float64}, BoundaryContainer3D{Float64, Float64}, L2MortarContainer3D{Float64}, Vararg{Vector{Array{Float64, 3}}, 5}}})
    @ Main ~/trixi_cuda/cuda_dg_3d.jl:743
 [24] macro expansion
    @ ~/trixi_cuda/cuda_dg_3d.jl:953 [inlined]
 [25] var"##core#6916"()
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:489
 [26] var"##sample#6917"(::Tuple{}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:495
 [27] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:99
 [28] #invokelatest#2
    @ ./essentials.jl:818 [inlined]
 [29] invokelatest
    @ ./essentials.jl:813 [inlined]
 [30] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [31] run_result
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [32] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117
 [33] run (repeats 2 times)
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117 [inlined]
 [34] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:169 [inlined]
 [35] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:168
 [36] top-level scope
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:393

For me, the most informative line is

Reason: unsupported dynamic function invocation (call to boundary_condition_periodic)

I think the errors came from the process that when we call something like https://github.com/huiyuxie/trixi_cuda/blob/e23e596d8c71364fe8d22239c80c762f87153103/cuda_dg_3d.jl#L472C9-L472C9, the function is chosen from boundary_conditions and stored in boundary_condition but the types of those boundary_conditions are not specified in the kernel prototype thus the complier reports that it is a dynamic function invocation. Since the GPU calls cannot be too dynamic, the argument types have to be specified in the kernel prototype (like flux::Function). There is solution to this error but it should also be very troublesome. For example, we could prototype each function candidate from boundary_conditions and then using if/else condition to control kernel calls. But that is much more troublesome and it certainly makes me uncomfortable (as it seems can be completed in one clean kernel call).

I asked the same question in the GPU community https://discourse.julialang.org/t/is-there-any-good-way-to-call-functions-from-a-set-of-functions-in-a-cuda-kernel/102051. If no one replies or the answer is still not satisfying, I will use my own imagination to solve the issue. Now, I think it is better to work on other kernels first. :)

The problem in the second approach you mentioned above is that boundary_condition_periodic is just a singleton that cannot be called. We use it to set a boundary as periodic - meaning that it is included in the internal interfaces and does not need to be handled by calc_boundary_flux! etc. Thus, we can just skip it here. That's what we do in Trixi.jl in https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L768-L771

Did you try implementing the 3D stuff as we do in https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L773-L801 but replacing the calls to calc_boundary_flux_by_direction! by calls to corresponding CUDA kernels? I think this should work as long as you implement something like https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L768-L771 to skip periodic boundary conditions. For such a kernel, the boundary_condition condition function is passed explicitly as argument so that it's type (and thus it's implementation) are known at compile time.

Yes! I realized that it could be implemented by directly passing something like boundary_condition::Function as the kernel argument last night while I was lying in bed (it is too late so I did not update here). I don't know why this did not occur to me earlier yesterday. Perhaps I was not that interested in implementing it in this way. Anyway, I will try it as soon as possible tonight. Thanks for your information and advice!

I don't quite understand

Thus, we can just skip it here. That's what we do in Trixi.jl in https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L768-L771

You did not skip boundary_condition_periodic when it appears in the form of something like https://github.com/huiyuxie/trixi_cuda/blob/6b736229fdea3466f008e4b5edac2ddad1dc566f/tests/hypdiff_nonperiodic_3d.jl#L5-L10 as it invokes https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L773 instead of https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L768 you mentioned above. And it further invokes https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L824-L826. I think I cannot just skip them directly as each element from boundary_conditions should be called to make it at least consistent with the CPU code and also we don't exactly know the location of boundary_condition_periodic in boundary_conditions each time.

Unfortunately, same errors occurred when I ran https://github.com/huiyuxie/trixi_cuda/blob/7dab85ce52c00cc5db23eee207696c547442f25b/cuda_dg_3d.jl#L449-L482 and here are the errors

ERROR: InvalidIRError: compiling MethodInstance for boundary_flux_kernel!(::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 5, 1}, ::CuDeviceArray{Float32, 4, 1}, ::Float32, ::CuDeviceVector{Int64, 1}, ::Int64, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::BoundaryConditionPeriodic, ::HyperbolicDiffusionEquations3D{Float32}, ::FluxLaxFriedrichs{typeof(max_abs_speed_naive)}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to boundary_condition_periodic)
Stacktrace:
 [1] boundary_flux_kernel!
   @ ~/trixi_cuda/cuda_dg_3d.jl:472
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:411 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:410 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:118
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:92 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:88
 [10] compile
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:79 [inlined]
 [11] compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:125
 [12] #1032
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:120 [inlined]
 [13] LLVM.ThreadSafeContext(f::CUDA.var"#1032#1033"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ LLVM ~/.julia/packages/LLVM/5aiiG/src/executionengine/ts_module.jl:14
 [14] JuliaContext
    @ ~/.julia/packages/GPUCompiler/NVLGB/src/driver.jl:35 [inlined]
 [15] compile
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/compilation.jl:119 [inlined]
 [16] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:125
 [17] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/NVLGB/src/execution.jl:103
 [18] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:318 [inlined]
 [19] macro expansion
    @ ./lock.jl:267 [inlined]
 [20] cufunction(f::typeof(boundary_flux_kernel!), tt::Type{Tuple{CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 5, 1}, CuDeviceArray{Float32, 4, 1}, Float32, CuDeviceVector{Int64, 1}, Int64, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, BoundaryConditionPeriodic, HyperbolicDiffusionEquations3D{Float32}, FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:313
 [21] cufunction
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:310 [inlined]
 [22] macro expansion
    @ ~/.julia/packages/CUDA/pCcGc/src/compiler/execution.jl:104 [inlined]
 [23] cuda_boundary_flux!(t::Float32, mesh::TreeMesh{3, SerialTree{3}}, boundary_conditions::NamedTuple{(:x_neg, :x_pos, :y_neg, :y_pos, :z_neg, :z_pos), Tuple{typeof(boundary_condition_poisson_nonperiodic), typeof(boundary_condition_poisson_nonperiodic), Vararg{BoundaryConditionPeriodic, 4}}}, equations::HyperbolicDiffusionEquations3D{Float32}, dg::DGSEM{LobattoLegendreBasis{Float64, 5, SVector{5, Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, LobattoLegendreMortarL2{Float64, 5, Matrix{Float64}, Matrix{Float64}}, SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}, VolumeIntegralWeakForm}, cache::NamedTuple{(:elements, :interfaces, :boundaries, :mortars, :fstar_upper_left_threaded, :fstar_upper_right_threaded, :fstar_lower_left_threaded, :fstar_lower_right_threaded, :fstar_tmp1_threaded), Tuple{ElementContainer3D{Float64, Float64}, InterfaceContainer3D{Float64}, BoundaryContainer3D{Float64, Float64}, L2MortarContainer3D{Float64}, Vararg{Vector{Array{Float64, 3}}, 5}}})
    @ Main ~/trixi_cuda/cuda_dg_3d.jl:572
 [24] top-level scope
    @ ~/trixi_cuda/cuda_dg_3d.jl:783

The problem should still be boundary_condition_periodic. If this is not callable as you said

The problem in the second approach you mentioned above is that boundary_condition_periodic is just a singleton that cannot be called.

Then how could the lines as below run successfully? https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L824-L826

I don't quite understand

Thus, we can just skip it here. That's what we do in Trixi.jl in https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L768-L771

Sorry, you're right. We just skip this step if all boundaries are periodic. If only some boundaries are periodic, the number of boundaries per direction used in https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L779 is just zero for periodic boundaries (where the corresponding boundary condition is boundary_condition_periodic).

The solution could be to replace each call like

calc_boundary_flux_by_direction!(surface_flux_values, t, boundary_conditions[1],
                                 equations, surface_integral, dg, cache,
                                 1, firsts[1], lasts[1])

in https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L783-L785 by something like

if boundary_conditions[1] != boundary_condition_periodic
    calc_boundary_flux_by_direction!(surface_flux_values, t, boundary_conditions[1],
                                     equations, surface_integral, dg, cache,
                                     1, firsts[1], lasts[1])
end

with some CUDA kernels instead of calc_boundary_flux_by_direction!

Thanks for your reply! But you still did not explain why boundary_condition_periodic is not callable just as you said

The problem in the second approach you mentioned above is that boundary_condition_periodic is just a singleton that cannot be called.

but the lines as below could run successfully https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L824-L826 (i.e., why didn't you use something like

if boundary_conditions[1] != boundary_condition_periodic
    calc_boundary_flux_by_direction!(surface_flux_values, t, boundary_conditions[1],
                                     equations, surface_integral, dg, cache,
                                     1, firsts[1], lasts[1])
end

in your CPU code?) Thanks!

This time the kernel can run with success, see https://github.com/huiyuxie/trixi_cuda/blob/31bc9050ee8a45176a333ea711a3b22e13a8326d/cuda_dg_3d.jl#L556-L608 But that is too much for a kernel like this and it should be implemented in a simpler way. I will work on making them into one call based on the ideas from the GPU community. Also I will keep this issue open until single one call has been finally achieved.

One suggestion: Could you @ranocha please double-check your answer and my question each time before replying? I tend to take your answers as absolutely correct (based on your background), so when I notice a discrepancy, I first doubt my own understanding. This causes me to spend extra time verifying things. I don't mind putting in extra effort, but I do mind wasting time on something like this. This is not a criticism but a friendly suggestion. If it comes off as rude, I apologize. I will delete this message once you have seen it :)

I try to answer your questions as good as I can. If you require that everything must be 100% correct every time, then there is no chance to continue any collaboration in a meaningful way. I am pretty sure that you understand that humans will make some errors from time to time.

Yes, and I completely understand your situation. I know you're busy with other things, like professors often are with their research during the summer, and that some people in the Slack chat seem indifferent to this project. As a result, you're left answering almost all my questions. I understand that if I were in the same situation, I would likely make mistakes as well. Also, I know that the more you do, the more mistakes are likely to happen. However, I see this as an issue with the distribution of work among mentors, not with your efforts. I have no complaints about your work so far and I truly value your responses. So this is simply a suggestion to double-check your answers.

If my words made you uncomfortable, then it should be my fault. I apologize to you @ranocha.

Now I don't know whether it is suitable to ask the same question for the third time here (as you skip it twice). It is right before the commit flow control #6 in this issue thread. If you are still willing to answer it then thanks (if no, I understand that). Again, I truly apologize for my words before.

Thanks for your reply! But you still did not explain why boundary_condition_periodic is not callable just as you said

The problem in the second approach you mentioned above is that boundary_condition_periodic is just a singleton that cannot be called.

but the lines as below could run successfully https://github.com/trixi-framework/Trixi.jl/blob/fe6a818a8459d6beef3969c1fd2d5cc7ddf596df/src/solvers/dgsem_tree/dg_3d.jl#L824-L826 (i.e., why didn't you use something like
if boundary_conditions[1] != boundary_condition_periodic
    calc_boundary_flux_by_direction!(surface_flux_values, t, boundary_conditions[1],
                                     equations, surface_integral, dg, cache,
                                     1, firsts[1], lasts[1])
end
in your CPU code?) Thanks!

We didn't use it for the CPU code since there are less restrictions than for GPU code. In particular, we just didn't recognize it. On CPUs, the function will be compiled fine with arguments firsts[1], lasts[1] such that there is just an empty loop in the function. Thus, it never tries to call boundary_condition_periodic. This cannot be detected at compile time (since it depends on the values the indices determining the loop). On CPUs, it would just throw an exception if it tried to call boundary_condition_periodic. This is not supported on GPUs and they use stricter compile time requirements to enforce this. That's why there is a difference between the CPU and GPU versions. We could of course also use the logic with the additional check boundary_conditions[1] != boundary_condition_periodic on CPUs - it is just not necessary to make it work and I do not expect a significant performance impact, so there is not really a good reason to make it more complex than it is right now. This is of course different for GPUs since you need something like this to make it work.

If my words made you uncomfortable, then it should be my fault.

No worries, we're on the same side and want to make your project successful!

Thank you so much for your reply!

I spent all night reflecting on my fault 1) when I make mistakes, you never blame me but always encourage me, so I should do the same for you, 2) I understand that these issues arise because someone else is not being responsible, so I shouldn't attribute faults caused by someone else to you, and 3) I'm used to being strict with myself as well as others in work, but I've overlooked the fact that I don't have the right to control other people's work styles.

Also, I fully understand that once something has been said, it cannot be taken back. Even though I've apologized, it has actually caused you hurt. You are a really nice person and the only one who takes my words seriously and values my work. I shouldn't have said something like that.

I'm not apologizing to receive a good evaluation (in fact, I don't care about that, and I'm not receiving any money from this project), but I'm apologizing because I genuinely value you as a person. I'm not apologizing to seek your forgiveness either, and you shouldn't feel obliged to forgive me if you don't want to. I promise you that something like this will never happen to you again.

This is something I didn't want to happen, but since I caused it, I will take responsibility. I will try to make it up to you in my own way.

trixi-gpu / TrixiCUDA.jl

`boundary_flux` kernel failed for 3D #6