Closed yzh119 closed 3 years ago
I think having sparse data structure support is great. Would be useful to discuss how to embbed such iteration constraints and transformations as a special kind of Block in TensorIR, so we can leverage the same benefit of infra we have so far(meta-schedule, imperative scheduling, and mix-match with dense schedules)
Consolidated to #466
Note: This RFC is more of a research idea brainstorm rather than discussion on current status of TIR development.
Halide and TVM decouples compute and schedule description so that we can only focus on optimizing schedules to accelerate workloads (halide auto-scheduler, ansor). While TACO and Taichi decouples compute and data format description, and they can generate different kernels corresponding to different input data formats (usually represented as a tree-hierarchical representation, see TACO paper).
There was a RFC discussing whether we add TACO's format declaration and Sparse Tensor data structures for TVM, and it was not active for over a year. Without extra schedule primitive support, the benefit of introducing Sparse Tensor in TVM is not clear as we just need to maintain several dense tensors:
(indptr, indices, val)
for CSR representation and(indptr, indices, val, block_size)
for BSR representation, we can then define common operators such as SpMM with these denses tensors as input in TVM, and manually write schedules for these operators (e.g. featgraph).Recently, TACO team published an paper(we refer to it as TACO-sche) on schedule primitive support for structured data and (very limited) auto-scheduler on top of that. This paper is based on TACO, the key idea is we cannot directly rewrite array access information because the transformation depend on the non-zeros elements of sparse matrix, instead they keep a log of history transformations (and call it the provenance graph), and recompute indices/deriving loop boundaries/iteration guard by inference on the provenance graph (this brings some runtime overhead but necessary for sparse data).
The schedule primitives supported in TACO-sche is very similar to that of Halide/TVM's, except for two: pos/coord,
pos
binds a itervar to a sparse structure andcoord
unbinds a itervar attached to a sparse structures and make it a free itervar iterates over the coordinate iteration space. During the codegen phase, some statements (e.g. binary search for random access on sparse axes in sparse tensors) are inserted to derive indices/bounds after transformation. The following figure depicts some rules for loop bound propagation:The experiment result mentioned in TACO-sche paper is not very promising (both SpMM/SDDMM are slower than existing standard libraries), however, their scheduler is very limited (no use of scratchpad memory, no tensorization), also, speed of sparse computation is highly data-dependent and they didn't utilize input data information.
I think provenance graph is a good design and we are expected to get much better performance if we can exploit larger schedule space (especially tensorization though I'm not sure if sparse algebra benefits much from tensorcore, I need to do some experiments verifying this point). Note that all TACO-sche compiler did is to transform iteration graphs (e.g. tensor IRs in this project) and they didn't change the original storage format of each tensor.
A promising direction is to support format declaration language in TVM(TIR) and data-dependent auto-tuning w/ tensorization support. Relay level, there is still room to optimize because we can tune the storage format (block size in blocksparse format, etc), and notice that format transformation is costly (actually format transformation should be viewed as operators like
view
), we can conduct graph-level optimization to globally optimize a end-to-end workload involving sparse algebras.Potential applications are GNNs, transformers applied on sparse data structures(the patterns are usually fixed and I suppose auto-tuning can help us finding the best schedule), and physics simulation(see Taichi and AutoTaichi, they support autodiff for differentiable rendering).
This RFC also didn't cover codegen for recursive neural networks (e.g. Cortex), though interesting, I don't know how to unify the the workload of recursive models and sparse matrix operations yet.