ut-parla / parla-experimental

5 stars 0 forks source link

PArray's CPU buffer memory leak #137

Open nicelhc13 opened 1 year ago

nicelhc13 commented 1 year ago

CPU memory leak (haven't seen GPU memory leak yet) was unveiled while I tested PArray::evict().

The below is the pseudo code.

data_list = []
for i in range(0, 20):
  subdata_list = []
  for j in range(0, 512):
    data = np.zeros([1000000, 2], dtype=np.float32)
    subdata_list.append(data)
  data_list.append(subdata_list)

for i in range(iter):
  data_idx = i % 20
  execute_task_graph_with_eager_data(data_list[data_idx])

In the above code, at iteration around 10~11, eviction manager starts PArray eviction. If we put memory consumption ratio of psutil before and after evict() like this,

print("before:", psutil.virtual_memory()[2])
evictable_parray.evict()
print("after:", psutil.virtual_memory()[2])

That print shows that the after consumption is slightly increased from before. If you keep running this progam, at around iteration 15, it is crashed due to out-of-memory on CPU side. This is not GPU side OOM.

Below is what I have found.

  1. evict() itself is not a problem. evict() calls cp.asarray to evict a cupy array to a numpy array. in this case, the destination numpy array is sometimes None, and new data is allocated. that's why memory consumption is increased.
  2. However, "1" is not the main problem. As you can see the pseudo code, data is preallocated on CPU side. So regardless of whether the data was moved to GPU and moved back to CPU, the max RSS should not exceed the preallocated data size.
  3. However, the later max RSS size exceeds the initial max RSS size. This is truly leak

I tried below:

  1. Replaced asnumpy with get(). but didn't resolve the issue.
  2. Disabled cupy's memory allocator and pinned memory allocator. It also didn't work
  3. tried to reproduce this error by toy examples to see if this is numpy or cupy side leak. but failed to reproduce this.

My current idea is that it should be our side problem, and it is highly likely PArray runtime side since the Parla runtime does not touch PArray::_buffer, but PArray itself. It might not release invalidated numpy array after someone on gpu updated this.