tensor-compiler / taco

The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs
http://tensor-compiler.org
Other
1.25k stars 188 forks source link

RFC: Tensor API #189

Open oicirtap opened 5 years ago

oicirtap commented 5 years ago

This issue presents a draft for the new API for Tensor objects in taco. This new API targets to address a number of issues with the current API, as well as integrate new features for increased functionality, simplicity and code portability. Think of this thread as a means for providing user feedback for the proposed new design. See the bottom of this post for links to other related github issues.

For the working API signature, refer to https://github.com/oicirtap/taco/blob/auto_tensor/include/taco/tensor_api.h

The Tensor API Overview

The new Tensor API targets to address the following major goals:

Here we can see a comparison between the current and new APIs. Current API:

// Create formats
Format csr({Dense,Sparse});
Format csf({Sparse,Sparse,Sparse});
Format  sv({Sparse});

// Create tensors
Tensor<double> A({2,3},   csr);
Tensor<double> B({2,3,4}, csf);
Tensor<double> c({4},     sv);

// Insert data into B and c
B.insert({0,0,0}, 1.0);
B.insert({1,2,0}, 2.0);
B.insert({1,2,1}, 3.0);
c.insert({0}, 4.0);
c.insert({1}, 5.0);

// Pack inserted data as described by the formats
B.pack();
c.pack();

// Form a tensor-vector multiplication expression
IndexVar i, j, k;
A(i,j) = B(i,j,k) * c(k);

// Compile the expression
A.compile();

// Assemble A's indices and numerically compute the result
A.assemble();
A.compute();

std::cout << A << std::endl;

New API:

// Create formats
Format csr({Dense,Sparse});
Format csf({Sparse,Sparse,Sparse});
Format  sv({Sparse});

// Create tensors
Tensor<double> A({2,3},   csr);
Tensor<double> B({2,3,4}, csf);
Tensor<double> c({4},     sv);

// Insert data into B and c
B(0,0,0) = 1.0;
B(1,2,0) = 2.0;
B(1,2,1) = 3.0;
c(0) = 4.0;
c(1) = 5.0;

// Form a tensor-vector multiplication expression
IndexVar i, j, k;
A(i,j) = B(i,j,k) * c(k);

std::cout << A << std::endl;

Support Classes and Structs

Coordinate

A small structure to represent a multidimensional coordinate tuple. Its purpose is to provide users with a default way in which to group coordinate components.

Component

A small structure to represent non-zero elements of a tensor. Users can populate tensors out of iterators of components (such as component lists). It is also used to represent non-zeros upon iteration over tensor objects.

The Tensor Class

A major change in the API is the removal of the TensorBase in favor for a unified use of the Tensor templated class as the standard for user representation of tensors. Tensor will be templated based on the tensor component type (type of the values stored within the tensor), and order (number of dimensions of the tensor).

Tensor Random Access

The new API will now support tensor value random access. In addition to the insert() methods, users will be able to call get() to read specific values from the tensor. In addition, the operator()() method will be overloaded to allow shorthand notation for insertion and access to tensor values. This can be seen in the example below.

Tensor<double> B({2,3,4}, csf);
B(0,0,0) = 1.0;
B(1,2,0) = 2.0;
B(1,3,1) = 3.0;

...

double n = B(0,0,0);
std::cout << n << std::endl;

Tensor Value Population

In the current API, tensors can be populated by inserting a set of values to a Tensor object with the insert() method, and then calling the pack() method, which consolidates all the inserted values into a single storage structure specified by the tensor format. In order to insert new values to the tensor after packing, the whole set of values needs to be inserted again (pack() does not take into account existing tensor values). The reason for this is that random insertion into an existing sparse data structure can be really expensive, and it is often faster to build the whole structure from the ground up.

This behavior however is not intuitive. The new API will abstract away the tensor reconstruction nuances by allowing users to insert and remove elements, and operate the tensor at any point, while preserving the preexisting tensor values.

Additionally, a new setFromComponents() method will allow users to first build a collection of Components, and then insert all of them into an empty tensor in a single operation.

Tensor Slices

A new feature that will be supported by the new API is the capability of operating on tensor slices. This includes inserting value slices to a tensor, extracting specific slices, and defining tensor operations that read only sections of a tensor.

Range IndexVars

In order to support tensor slices, a new type of IndexVar will be added. This new RangeIndexVar will keep track of the index range that will be iterated upon. RangeIndexVar objects will be derived from regular IndexVar objects via the operator(), and will reference the same index (example below).

Range IndexExpr

Tensor slice operations are created via range index expressions. In order to create valid expressions, the range of any given index variable must have the same size throughout the expression. Here are some examples of how this can be used:

// Extracting a tensor slice. The dimensions of A must be 10x10.
A(i,j) = B(i(0,10), j(3,13));

// Index variable ranges must have same size.
M(i(0,5), j(0,10)) = N(i(5,10), j(3,13));

// Operating on tensor slices
x(i) = Y(i, j(10,20)) * z(j(4,14));

Hiding Compiler Methods

In order to make taco syntax more intuitive as described above, the explicit use of the following set of methods will become optional:

We call these Compiler Methods.

In the current API, the use of these methods is necessary to populate tensors and perform tensor computations. On the other hand, they can lead to very unintuitive and confusing code syntax and semantics, yet the way in which these methods must be used is very predefined. Therefore, the new API will abstract away the necessary calls to these methods.

These methods will still be available so interested users have the option to control the flow and timing of tensor computations if desired.

Delayed Computation Execution

Since compiler methods are still necessary for the correct functionality of the tool, they must be called at some point. In this section we discuss how hiding compiler methods is achieved.

When tensor operation (such as a(i) = B(i,j)*c(j)) is declared, the computation of the expression doesn't occur right away. Instead users have the option of adding scheduling commands to the operation to optimize it (such as adding workspaces to the operation). Because of this, the API cannot compile and compute te expression right away. Instead, the new API will implement a new delayed computation pattern in which tensor operations will happen only when needed.

In this pattern there are two events that could trigger a computation:

Additionally, tensors will keep track of their compilation state, so repeated calls to compiler methods o not produce errors (i.e. Calling compute multiple times would only compute once and return the previously computed result in subsequent calls).

With these two principles, users can write tensor programs without having to worry, or even think about the calls to compiler methods happening behind the scenes.

Related issues

24 #50 #85 #90 #95 #145 #167 #168 #179 #180 #181

DJDDDM commented 5 years ago

This looks really good.

However I have a few minor things:

  1. RangedIndex in cases like A(i,j) = B(i,k(0,10)) C(j,k(0,10)) I would prefer to be able to RangedIndex k(0,1) A(i,j) = B(i,k) C(j,k) to do the same. In longer equations it would give a problem having to write the same range multiple time.

  2. Inserting sparsity or removing values With better inserts can we also have a method to undo inserting? like A(3,3) = sparse or A(3,3).remove()

Given the currently plant API changes it should be fairly easy to do

  1. formatted printing
  2. elementwise equality checks

Will the more advanced features like functional programming and blocked tensors come in the next theoretically enhanced version? (I have seen a paper from your group about different tensor formats)

Will better error reports already come with the new API? (Or I guess most of the errors that exists now will not exist any longer)

How is the status with transpositions? From the outside these seem to be just an API problem.

I guess, once the new API exists functionality like eigenvalue decomposition and sparsing out values will be written by users. At that stage I will probably be able to help. Then we would have a decoupling of API, Theory and Functionality and people could easily contribute to Functionality.

fredrikbk commented 5 years ago

We will work on better error reports for the new version for sure. We’ve been working hard to clean up the theory and think we’re getting there.

The new version will have a hand-written transpose method, like you suggested in an earlier issue. We are however working on a theory for generating transpose code, even together with other expressions, but that will come later (and only of course if it works out).

We’re hoping to get the blocking into the next version (early summer) and I have good hope we can do it. I don’t think it requires changes to the machinery.

Finally, we’ll have improved support for formats probably in the next version with a third level format called Singleton so we can generate coordinates at least. More formats will follow.

CharlesJQuarra commented 2 years ago

Regarding delayed computation: Modifying a dependency of another tensor might require sophisticated versioning semantics. Is In-place modification really necessary? If you want to avoid deep copies of tensors on localized updates, Copy-on-Write versioning could avoid storage overhead, and avoid semantic mutability

To further clarify what I mean here: COW versioning on a sparse storage is just layering sparse storages on top of each other. An entry can exist on any of the storages of the storage stack, but for any given index, only the topmost value is considered current, conceptually like an order-aware version of UnionFS.

And there is always the possibility of squashing many elements within a stack into a single storage, if no one is keeping a reference to the corresponding versions (every version would need a refcount)