microsoft / knossos-ksc

Compiler with automatic differentiation
Other
45 stars 10 forks source link

Implement destination passing style (DPS) #390

Open awf opened 4 years ago

awf commented 4 years ago

See link at #364

Or, doing shapes. Principles:

(def foos (Vec (Vec (Tuple String Float))) () (build 7 (i -> (build i (j -> (tuple "fred" (to_float (+ i j)))))))

(def shape$foos (Vec (Vec (Tuple Integer))) () (shape (build 7 (i -> (build i (j -> (tuple "fred" (to_float (+ i j))))))))

(rule ((a : Vec 't)) (shape a) (build (size a) (lam (i : Integer) (shape (index i a))))) (rule ((body : 't)) (shape (build e (lam ((i : Integer)) body))) (build e (lam (i : Integer) (shape body))))


Let's not do sizes, but use these to indicate optimization...

(def bytes$foo T (s : S) (bytes (foo s))) ;; has no effect on heap, as returns scalar


Stage 2. But expensive... try these rewrites:

(bytes (build n (i -> b))) -> (sumbuild n (i -> (bytes b))) (bytes g xs) -> (bytes$g xs) (sumbuild n (i -> const)) -> (mul n const) ;; already in

dcrc2 commented 4 years ago

Current definition of shape:

The idea behind encoding sizeof(Type) in the shape is that we may be able to evaluate bytes given only the shape of an object, without even knowing its original type, ie. (bytes e) = (bytesFromShape (shape e)).

But I don't think it's clear yet whether we'll be able to maintain this property. Maybe we'll end up needing different versions of bytesFromShape depending on the type of e.

One slight complication: sizeof(tuple<Ts...>) is not necessarily equal to sum(sizeof(Ts)...), because of padding: indeed the precise value may depend on the C++ compiler/library implementation. If we want to avoid a dependency on knowledge of the C++ compiler we will have to make quite conservative assumptions about padding, which may mean overallocating slightly.

dcrc2 commented 4 years ago

How would we define the shape of a linear map?

Two difficulties:

(For the moment I'm ducking this issue by not defining shape$ for functions which involve linear maps.)

awf commented 4 years ago

Yep, ducking correct at the moment.

dcrc2 commented 4 years ago

PR #411 contains the current specification of shape$ functions.

Changed from the original plan: the shape of a Float/Integer/Bool/String is now (tuple).

dcrc2 commented 4 years ago

So, there are some important differences here compered to the DPS paper.

The DPS paper makes two assumptions which the version here does not make:

Reasons for not wanting to make these assumptions:

But dropping these assumptions has some awkward consequences:

If we're using this for DPS, we seem to have a problem if evaluating a shape may allocate memory. If the idea is to evaluate shape$f before we evaluate f, won't we also need to evaluate shape$shape$f, shape$shape$shape$f, and so on? With our current definition of shape, this sequence doesn't terminate! (Maybe instead we should define the shape of a Vec T to be its size when T is scalar: then the sequence would eventually terminate in a scalar.) A complication when thinking about this is that we don't actually intend to call shape$f as such: instead we're planning to inline it into bytes$f (to find out how much memory to allocate), which does return a scalar. But I don't think the problem goes away here: we're just pushing it down into the inlined expression. If there is a build involved somewhere, we're going to have to fall back to a non-DPS implementation at some point or we'll end up with an infinite recursion. (Edited: actually, no, perhaps not infinite. But I think it may involve an arbitrarily long recursion, with one level for each build in the expression that you are evaluating?)

Use of constVec may help to some extent. In the usual case where the elements of a vector do all have the same shape (and in particular for Vec Scalar), the shape will be a constVec. We can potentially change Cgen so that constVec objects are treated specially and do not require memory from the allocator.

Even with special treatment for constVec, it seems there will always be examples where we don't want to use DPS. Ultimately, RLO can decide whether DPS is to be used. But for the moment, we'll presumably need some heuristic for when Cgen will use DPS. How about, DPS will be used iff evaluating the shape of the expression does not involve allocating memory? (This is something that Cgen can easily determine, and we can build in the assumption that a constVec does not involve allocation.)

awf commented 4 years ago

We can certainly say we won't use DPS to call shape$ -- just use the allocator. We can release all the memory after we compute size/cost, even if we didn't inline the shape call.

So I think we can have a whole bunch of heuristics about when to use DPS -- maybe even did the shape function optimize to trivial? Etc etc.

awf commented 4 years ago

Oh, and yes, jagged tensors are an important case (think of sentences going through BERT).

dcrc2 commented 4 years ago

We can certainly say we won't use DPS to call shape$ -- just use the allocator.

Oh, yes, I'd missed that! It's pointless using DPS to call shape$: we don't care where the result is stored as we're going to throw it away anyway!

I think the question I should have been asking is, if the implementation of shape$f involves calling a function f2, then should that call use DPS? I expect the answer should be, we ignore the fact that we're implementing a shape function here, and just decide whether to use DPS in the same way that we do for ordinary functions.

But if we always use DPS, including in the implementation of shape functions, then in theory we could end up with an exponential increase in runtime: in the worst case, shape$f2 is approximately as expensive as f2, so a DPS call to f involves

ie. about four times as expensive as our current implementation. And further if f2 calls f3 which calls f4 and so on, in the worst case there will be 2^n calls to fn or shape$fn.

We're hoping that shape$fm depends only on shape$f{m+1} and not on f{m+1} itself, but because we're allowing shapes to depend on runtime values, that can't be guaranteed.

I think this is why I'm a bit scared to do without any heuristic for when to use DPS. (The other part is, since we're generating shape$f by repeatedly inlining any calls to shape$f2 and so on, there's a similar potential for exponential increase in code size.)

awf commented 4 years ago

One option is to use DPS rarely, perhaps even under a directive like $inline?

awf commented 4 years ago

So just to note: we currently have jagged matrices in GMM:

(def mul$R$VecR (Vec Float) ((r : Float) (a : Vec Float))
    (build (size a) (lam (i : Integer) (mul r (index i a)))))

(def mul$R$VecVecR (Vec (Vec Float)) ((r : Float) (a : Vec (Vec Float)))
    (build (size a) (lam (i : Integer) (mul$R$VecR r (index i a)))))

So the correct shape functions for these are:

(def shape$mul$R$VecR (Vec (Tuple)) ((r : Float) (a : Vec Float))
    (constVec (size a) (tuple)))

(def shape$mul$R$VecVecR (Vec (Vec (Tuple))) ((r : Float) (a : Vec (Vec Float)))
    (build (size a) (lam (i : Integer) (constVec (size (index i a)) (tuple)))))

The latter does indeed build a vector of sizes, and if it were used to e.g. prepare a 100x100 tensor of zeros, it would incur an additional 100x1 memory cost to do so. This will of course go away for non-jagged tensors after #274, but it's worth noticing that for large jagged tensors, it's often a fairly negligible cost.

And the corresponding size functions will be

(def size$mul$R$VecR (Vec (Tuple)) ((r : Float) (a : Vec Float))
    (mul (size a) (sizeof$Float)))

(def size$mul$R$VecVecR (Vec (Vec (Tuple))) ((r : Float) (a : Vec (Vec Float)))
    (sumbuild (size a) (lam (i : Integer) (mul (size (index i a)) (sizeof$Float)))))

which of course avoid the allocation.