awf commented 4 years ago

Currently KSC's C++ backend uses a mark/release bump allocator. There is a global heap pointer, with implementations of alloc and free as follows:

void* heap; // Initialized in main() to point to some large chunk of memory.
void* alloc(int size) {
  void * ret = heap;
  heap += size;
  return ret;
}
void free(void*) {
  // Do nothing
}

It's super fast, obviously, but, also obviously, it kinda runs out of memory almost immediately when you have lots of large Tensors flying around. However one small change makes it surprisingly effective: specializing scalar returns. I'll describe that, and then a few more ideas, viz "copydown" and "destination passing style".

Scalar returns.

Suppose we have a function called, say, loss, which does lots of computation, lots of allocation, and then returns a double.

double val = loss(some_stuff);

Because KS is referentially transparent, none of the allocations that happened in loss can possibly be needed again, so it's OK to just do this:

void* tmp = heap;  // Remember the heap pointer
double val = loss(some_stuff);
heap = tmp;  // Restore it.  [Don't worry about destructors, we don't have resource allocation in KS other than memory].

Voila! This is actually pretty decent, assuming we occasionally observe these scalar-only returns. And in fact, it's already done in CGen, using some macros $MRK, $REL:

$MRK(tmp);
double val = loss(some_stuff);
$REL(tmp);

Copydown

But we may have a long-running loop where such a scalar return doesn't happen. E.g.

for(int i = 0; i < 100000; ++i) {
   void* tmp = heap;
   vec<double> val = compute(some_stuff);  // This vec<double> is stored somewhere in the heap, not necessarily the bottom or top
   // heap = tmp; // So we can't restore the heap pointer here
}

But note that the size of val may be much smaller than the difference tmp - heap after the computation -- the function may have made lots of allocs in order to compute val. So here's how we use copydown. The value of val is somewhere in the heap. If we could copy it down to just after tmp, then we could re-point the heap to just after val. If it's just a block of memory, a simple memmove will work:

for(int i = 0; i < 100000; ++i) {
  void* tmp = heap;
  vec<double> val = compute(some_stuff);  // This vec<double> is stored somewhere in the heap, not necessarily the bottom or top
  // if "val" were one contiguous blob, memmove handles the overlap well, even if val is 40 bytes, starting at tmp + 13
  memmove(tmp, &val, val.size_in_bytes());
  heap = tmp + val.size_in_bytes(); // Yay, much slower growth.
}

But there's a bit of fiddling needed as val is not one blob but a tree-like structure. Hence writing this issue.

Destination passing style

DPS is even better, sometimes. It's described here: DPS

simonpj commented 4 years ago

Yes to all of this. But

It's hard to know whether copydown is a win. I think DPS might be a better investment
copydown relies on val being freshly allocated. If it turns out to be an already-allocated thing, no need to copy!
For nested structures, copydown needs to recursively walk the structure.

dcrc2 commented 4 years ago

My main question is what to do about this kind of function, where a vector is returned which aliases a function argument:

(def choose (Vec Float) (( x : (Vec Float) ) ( y : (Vec Float) ))
     (if (first_is_better x y) x y))

(def slice (Vec (Vec Float))
        (( v : (Vec (Vec Float)))
         ( start : Integer )
         ( n : Integer ))
    (build n (lam (i : Integer)
        (index (add i start) v))))

How are such functions handled using DPS? Can we avoid introducing copies of vectors, where our current implementation involves no copying?

awf commented 4 years ago

Yes to all of this. But

It's hard to know whether copydown is a win. I think DPS might be a better investment

Yes, it might even feel necessary to specify in the source whether a value should be copydowned, which does open a can of worms. The optimizer should decide, depending on stuff.

copydown relies on val being freshly allocated. If it turns out to be an already-allocated thing, no need to copy!

I don't see this one -- do you mean it was passed in as per DC's question?

For nested structures, copydown needs to recursively walk the structure.

Yes, and needs to do so carefully to always ensure it doesn't clobber existing data. A list A->B->C which lives just at the bottom of the heap, but in the order A C B, may need to copy twice:

//Heap: | Blah |
//             ^
tmp = heap;
val = f();
//Heap: | Blah | A C .. B .. burble ...|
//                                     ^
val1 = deep_copy(val);
//Heap: |A C B .. burble ...|A B C|
//                                ^
heap = tmp;
//Heap: |A C B .. burble ...|A B C|
//      ^
// Now read from beyond top of heap... Take care with threading
val = deep_copy(val1); // could be more efficient memmove but super fiddly
//Heap: |A B C|
//            ^

awf commented 4 years ago

My main question is what to do about this kind of function, where a vector is returned which aliases a function argument:
(def choose (Vec Float) (( x : (Vec Float) ) ( y : (Vec Float) ))
     (if (first_is_better x y) x y))
(def slice (Vec (Vec Float))
        (( v : (Vec (Vec Float)))
         ( start : Integer )
         ( n : Integer ))
    (build n (lam (i : Integer)
        (index (add i start) v))))
How are such functions handled using DPS? Can we avoid introducing copies of vectors, where our current implementation involves no copying?

[WRONG SEE BELOW: So the short answer is we should be copying today -- are we not?] I know there's a lot of passing of vec without copying, and of course all the inputs are in the caller's heap, so returning aliases feels OK, but I feel there's something here I used to think was fine, and now might not be...

Could you make a nasty example that breaks badly today?

dcrc2 commented 4 years ago

So the short answer is we should be copying today -- are we not?

No, we aren't. I'd be interested to hear the longer answer :)

Could you make a nasty example that breaks badly today?

I can't think of a way to break it. The only mutation that takes place is in sumbuild, where a deep copy is made first. So the only possible problem seems to be if we were to release the memory for the arguments too early. But we're being very conservative about when to release memory at the moment, so I suspect there's no way to make this break.

simonpj commented 4 years ago

I don't see this one -- do you mean it was passed in as per DC's question?

Yes, exactly. It was passed in -- or was a sub-component of something passed in

awf commented 4 years ago

I can't think of a way to break it. The only mutation that takes place is in sumbuild, where a deep copy is made first. So the only possible problem seems to be if we were to release the memory for the arguments too early. But we're being very conservative about when to release memory at the moment, so I suspect there's no way to make this break.

Ah, so that is the long answer -- I retract my "short answer" :)

dcrc2 commented 4 years ago

It seems clear that DPS is the way to go when the size can be computed efficiently. I've been wondering about a couple of ways of dealing with the situation where precomputing the size is not feasible:

Size prediction: Guess that the value returned by a function call has (at most) the same size as it did the last time. If the guess turns out to be wrong, resort to a more expensive approach (copydown, or just use malloc?). If a function is only called once, the expense probably doesn't matter; if it's called many times then the prediction will likely be correct...
Two bump-allocator buffers: Split the allocated memory into buffer A and buffer B. Similar to DPS, each function gains an extra parameter for the destination, but here the destination is a buffer rather than just an address. If a function is asked to place its result in buffer A, then it can use buffer B for local allocation (in which case it should clean it up before returning). Similarly, in the implementation of a build/sumbuild, if the result of the build/sumbuild is needed on buffer B, then computations local to each iteration can be put on buffer A and the memory reclaimed after each iteration.

awf commented 4 years ago

Size prediction: Guess that the value returned by a function call has (at most) the same size as it did the last time. If the guess turns out to be wrong, resort to a more expensive approach (copydown, or just use malloc?). If a function is only called once, the expense probably doesn't matter; if it's called many times then the prediction will likely be correct...

I really don't like the statefulness of recording previous calls - it makes analysis of runtime and memory use very difficult. The concrete example I would have in mind here is filter - imagine this code

def f(xs):
   return [x where x > 0]

So, it is of course possible to compute the size in roughly the same time as the function takes to run - maximum 2x speed overhead. With DPS, we could either pay the nearly 2x, or we could just pay in excess memory usage. The tradeoff depends on the branch probability (which I believe should be specified in the IR, even if it's determined empirically, but that's a separate discussion).

awf commented 4 years ago

Two bump-allocator buffers: Split the allocated memory into buffer A and buffer B. Similar to DPS, each function gains an extra parameter for the destination, but here the destination is a buffer rather than just an address. If a function is asked to place its result in buffer A, then it can use buffer B for local allocation (in which case it should clean it up before returning). Similarly, in the implementation of a build/sumbuild, if the result of the build/sumbuild is needed on buffer B, then computations local to each iteration can be put on buffer A and the memory reclaimed after each iteration.

Interesting. This does seem to work even if f calls g calls h, because h's effect on buffer A should be unobservable. Does it work for recursion (assuming the buffer flip is runtime-encoded, which I am OK with :))?

awf commented 4 years ago

Note DPS not always a win, and "OTP" says it must therefore be expressible in KS syntax, so..

(rule "dps" 
   ((f : Lambda S T)
    (s : S))
   (f s)
   (let (space (T_construct_uninitialized (f$size s)))
        (f$dps space s)))

toelli-msft commented 4 years ago

What does OTP mean?

awf commented 4 years ago

What does OTP mean?

microsoft / knossos-ksc

Implement copydown #364

Scalar returns.

Copydown

Destination passing style

400