mitsuba-renderer / drjit

Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering
BSD 3-Clause "New" or "Revised" License
603 stars 45 forks source link

[Feature Request] Fixed size larger matrices #195

Closed BernardoCovas closed 1 year ago

BernardoCovas commented 1 year ago

Would it be feasible to allow declarations of somewhat larger matrices and vectors as new static drjit types? This would allow implementations of neural algorithms involving simple fully connected layers to be perfectly embedded in a linear drjit fully fused kernel, including recorded loops. Matrices larger than 4x4 but still small enough to fit in a single kernel stack? Such as 64x64 or 128x128. I was thinking of something in the lines of: Acquiring a set of feature vectors that would be default drjit types (Float32, Array2f, Array3f), combining them into this new static sized Array256f or Matrix256f and applying operations on it. I tried an unrolled python loops on smaller 4x4 submatrices, and although it works, the trace and compilation times ar large. I am available to contribute as well.

Thank you

wjakob commented 1 year ago

Having builtin versions of larger matrices might very slightly benefit trace times, but it would not help you with the compilation times. Arithmetic involving such builtin arrays is unrolled just like your hand-written Python code. You will likely want to write a recorded loop that uses dr.gather to fetch matrix coefficients.

BernardoCovas commented 1 year ago

Hi Understood. I was thinking of, instead of unrolling the operation as it happens when we trace the matrix multiply, having a drjit-core jit op that would be somewhat flexible in it's parameter size. I would not use a predefined matrix size in that sense, I would pre-declare that a matrix is of such size Such as "using Matrix256f = dr.cuda.ad.MatrixXX(256, 256)" Or 'Using Vector256 = dr.cuda.ad.VectorXf(256)" The drjit-core kernel would inline a loop in the cuda kernel that would perform the matrix multiplication. For neural algorithms, we can gather the weights and biases since they are already evaluated (and likely being optimized by Adam), but we need to trace all single additions and multiplications of possible unevaluated computations as part of rendering itself (such as ray tracing a hash grid and predicting rgb color). In a recorded loop, we avoid such traces, and the loop is inlined. Moreover, when we acquire new input parameters from a computation such as multilevel hash grids, we don't have the evaluated parameters, as they will only eventually be computed for a single ray when the kernel is running The idea was to declare that a vector or matrix of fixed size exists, (possibly?) inline it, and allow a recorded loop operation to gather and scatter to this fixed size thread private buffer as part of an inlined loop, even though it's values are unevaluated at the time of tracing

wjakob commented 1 year ago

This all sounds neat, but I daresay is far beyond the scope of what we can support here or actually implement for you via a feature request. My recommendation would be that you hack on Dr.Jit yourself and share your results if they turn out nicely.

BernardoCovas commented 1 year ago

Alright Thank you