Implement destination passing style (DPS)

awf commented 4 years ago

See link at #364

[ ] Implement shape pass: all fs -> shape$fs
- [x] Haskell pass to generate shape$foo for all foo
- [ ] Haskell optimization rules/inlining heuristics adjusted
[ ] Implement DPS...
- [ ] DPS everywhere
- [ ] If nec, $dps

Or, doing shapes. Principles:

shape of a vec/tuple is a vec/tuple of shapes

shape of a scalar is number of bytes (strings all atomized for now)


(edef matmul (Vec (Vec Float))   ((a : (Vec (Vec Float))) (b : (Vec (Vec Float)))))
(def shape$matmul (Vec (Vec Integer))   ((a : (Vec (Vec [Abstract] Float))) (b : (Vec (Vec [Abstract] Float)))))
   (let ((rowsa (size a)) ;; Int
          (colsa (size (index a 0)))
          (rowssb (size b))
          (colsb (size  (index b 0)))
       (constVec rowsa (constVec colsb (shape (index (index a 0) 0)))))

(def foos (Vec (Vec (Tuple String Float))) () (build 7 (i -> (build i (j -> (tuple "fred" (to_float (+ i j)))))))

(def shape$foos (Vec (Vec (Tuple Integer))) () (shape (build 7 (i -> (build i (j -> (tuple "fred" (to_float (+ i j))))))))

(rule ((a : Vec 't)) (shape a) (build (size a) (lam (i : Integer) (shape (index i a))))) (rule ((body : 't)) (shape (build e (lam ((i : Integer)) body))) (build e (lam (i : Integer) (shape body))))


Let's not do sizes, but use these to indicate optimization...

(def bytes$foo T (s : S) (bytes (foo s))) ;; has no effect on heap, as returns scalar


Stage 2. But expensive... try these rewrites:

(bytes (build n (i -> b))) -> (sumbuild n (i -> (bytes b))) (bytes g xs) -> (bytes$g xs) (sumbuild n (i -> const)) -> (mul n const) ;; already in

dcrc2 commented 4 years ago

Current definition of shape:

Shape of an Integer/Float/Bool/String is ~~an Integer whose value is equal to sizeof(Type) in C++.~~ the unit type (Tuple).
Shape of a Tuple is a Tuple of shapes
Shape of a Vec is a Vec of shapes

The idea behind encoding sizeof(Type) in the shape is that we may be able to evaluate bytes given only the shape of an object, without even knowing its original type, ie. (bytes e) = (bytesFromShape (shape e)).

But I don't think it's clear yet whether we'll be able to maintain this property. Maybe we'll end up needing different versions of bytesFromShape depending on the type of e.

One slight complication: sizeof(tuple<Ts...>) is not necessarily equal to sum(sizeof(Ts)...), because of padding: indeed the precise value may depend on the C++ compiler/library implementation. If we want to avoid a dependency on knowledge of the C++ compiler we will have to make quite conservative assumptions about padding, which may mean overallocating slightly.

dcrc2 commented 4 years ago

How would we define the shape of a linear map?

Two difficulties:

Maybe it makes sense to define the shape of a linear map X -> Y as a 2-tuple, (shape of X, shape of Y). But then it would definitely no longer be the case that (bytes e) depends only on (shape e), as the shape of a LM would look like the shape of a tuple (which has a different size in bytes).
Cgen has more than one type representing a linear map: in order to calculate bytes of a LM, you would need to know the CType, not just the Type.

(For the moment I'm ducking this issue by not defining shape$ for functions which involve linear maps.)

awf commented 4 years ago

Yep, ducking correct at the moment.

dcrc2 commented 4 years ago

PR #411 contains the current specification of shape$ functions.

Changed from the original plan: the shape of a Float/Integer/Bool/String is now (tuple).

dcrc2 commented 4 years ago

So, there are some important differences here compered to the DPS paper.

The DPS paper makes two assumptions which the version here does not make:

In a Vec T, all the Ts must have the same shape.
The shape of a function return value only depends on the shape of the arguments. (And more generally, the shape of any expression depends only on the shapes of the free variables that it uses.)

Reasons for not wanting to make these assumptions:

These assumptions restrict the code that we can write. For example, they would forbid a filter which uses a condition based on runtime values.
We plan to have a (rectangular) Tensor type. So perhaps when we see a Vec (Vec T), the fact that we have chosen not to use a Tensor implies that the object can be jagged.

But dropping these assumptions has some awkward consequences:

The shape of a Vec T must itself be a Vec (at least when T is not a scalar type), and in general will involve a build.
Evaluating the shape of an expression may involve allocating memory. (This is partly because the shape may involve a build, but also because if the shape depends on a runtime condition, we may need to evaluate the entire function up to the point where that condition is applied.)
Evaluating the shape of an expression may be as expensive as evaluating the expression itself.
(Not completely sure about this - I may be misunderstanding, or plain wrong...) The DPS paper shows how to generate shape functions by a source-to-source transformation. Dropping the two assumptions means that this is no longer possible, and we have to rely on rewrites. We'll need to invoke our theorem-prover in order to optimize well.

If we're using this for DPS, we seem to have a problem if evaluating a shape may allocate memory. If the idea is to evaluate shape$f before we evaluate f, won't we also need to evaluate shape$shape$f, shape$shape$shape$f, and so on? With our current definition of shape, this sequence doesn't terminate! (Maybe instead we should define the shape of a Vec T to be its size when T is scalar: then the sequence would eventually terminate in a scalar.) A complication when thinking about this is that we don't actually intend to call shape$f as such: instead we're planning to inline it into bytes$f (to find out how much memory to allocate), which does return a scalar. But I don't think the problem goes away here: we're just pushing it down into the inlined expression. If there is a build involved somewhere, we're going to have to fall back to a non-DPS implementation at some point or we'll end up with an infinite recursion. (Edited: actually, no, perhaps not infinite. But I think it may involve an arbitrarily long recursion, with one level for each build in the expression that you are evaluating?)

Use of constVec may help to some extent. In the usual case where the elements of a vector do all have the same shape (and in particular for Vec Scalar), the shape will be a constVec. We can potentially change Cgen so that constVec objects are treated specially and do not require memory from the allocator.

Even with special treatment for constVec, it seems there will always be examples where we don't want to use DPS. Ultimately, RLO can decide whether DPS is to be used. But for the moment, we'll presumably need some heuristic for when Cgen will use DPS. How about, DPS will be used iff evaluating the shape of the expression does not involve allocating memory? (This is something that Cgen can easily determine, and we can build in the assumption that a constVec does not involve allocation.)

awf commented 4 years ago

We can certainly say we won't use DPS to call shape$ -- just use the allocator. We can release all the memory after we compute size/cost, even if we didn't inline the shape call.

So I think we can have a whole bunch of heuristics about when to use DPS -- maybe even did the shape function optimize to trivial? Etc etc.

awf commented 4 years ago

Oh, and yes, jagged tensors are an important case (think of sentences going through BERT).

dcrc2 commented 4 years ago

We can certainly say we won't use DPS to call shape$ -- just use the allocator.

Oh, yes, I'd missed that! It's pointless using DPS to call shape$: we don't care where the result is stored as we're going to throw it away anyway!

I think the question I should have been asking is, if the implementation of shape$f involves calling a function f2, then should that call use DPS? I expect the answer should be, we ignore the fact that we're implementing a shape function here, and just decide whether to use DPS in the same way that we do for ordinary functions.

But if we always use DPS, including in the implementation of shape functions, then in theory we could end up with an exponential increase in runtime: in the worst case, shape$f2 is approximately as expensive as f2, so a DPS call to f involves

Calling shape$f, which calls shape$f2 and then f2; then
Calling f, which again calls shape$f2 and then f2,

ie. about four times as expensive as our current implementation. And further if f2 calls f3 which calls f4 and so on, in the worst case there will be 2^n calls to fn or shape$fn.

We're hoping that shape$fm depends only on shape$f{m+1} and not on f{m+1} itself, but because we're allowing shapes to depend on runtime values, that can't be guaranteed.

I think this is why I'm a bit scared to do without any heuristic for when to use DPS. (The other part is, since we're generating shape$f by repeatedly inlining any calls to shape$f2 and so on, there's a similar potential for exponential increase in code size.)

awf commented 4 years ago

One option is to use DPS rarely, perhaps even under a directive like $inline?

awf commented 4 years ago

So just to note: we currently have jagged matrices in GMM:

(def mul$R$VecR (Vec Float) ((r : Float) (a : Vec Float))
    (build (size a) (lam (i : Integer) (mul r (index i a)))))

(def mul$R$VecVecR (Vec (Vec Float)) ((r : Float) (a : Vec (Vec Float)))
    (build (size a) (lam (i : Integer) (mul$R$VecR r (index i a)))))

So the correct shape functions for these are:

(def shape$mul$R$VecR (Vec (Tuple)) ((r : Float) (a : Vec Float))
    (constVec (size a) (tuple)))

(def shape$mul$R$VecVecR (Vec (Vec (Tuple))) ((r : Float) (a : Vec (Vec Float)))
    (build (size a) (lam (i : Integer) (constVec (size (index i a)) (tuple)))))

The latter does indeed build a vector of sizes, and if it were used to e.g. prepare a 100x100 tensor of zeros, it would incur an additional 100x1 memory cost to do so. This will of course go away for non-jagged tensors after #274, but it's worth noticing that for large jagged tensors, it's often a fairly negligible cost.

And the corresponding size functions will be

(def size$mul$R$VecR (Vec (Tuple)) ((r : Float) (a : Vec Float))
    (mul (size a) (sizeof$Float)))

(def size$mul$R$VecVecR (Vec (Vec (Tuple))) ((r : Float) (a : Vec (Vec Float)))
    (sumbuild (size a) (lam (i : Integer) (mul (size (index i a)) (sizeof$Float)))))

which of course avoid the allocation.

microsoft / knossos-ksc

Implement destination passing style (DPS) #390