Data Copying - Githubissues

ianwhale commented 6 years ago

I'm writing a feature selection program and I'd like to keep the large data matrix in shared memory and only read from it. All I need to do is select a some columns from the matrix and then do some work on them. I'm fine with the subset of the matrix being copied, but the whole matrix should stay put.

Here's a toy example, the matrix data.npy is a matrix of ints of size around 7000 x 50000.

import sharedmem
import numpy as np

d = np.load("data.npy")
d = sharedmem.copy(d)

# Generate some random index lists.
index_list = [np.random.random_integers(0, 49999, 1000) for _ in range(10)]

with sharedmem.MapReduce(np=4) as pool:
    def work(indices):
        subset = d[:, indices]  # Select 1000 columns specified by the index list.
        return 0

    pool.map(work, index_list)

The matrix takes up about 1.4 GB in memory. From my top output, it is obvious that the matrix being copied to all the worker processes.This might just come down to something with reference counters that is unavoidable when I do the subsetting, but I want to be sure. Let me know if something is unclear.

rainwoodman commented 6 years ago

I remember on Linux RSS counts allocated shared memory segments. There probably is a way to count out shared segments. See, e.g.

https://unix.stackexchange.com/questions/254752/determine-the-actual-memory-usage-of-several-processes-that-share-a-large-memory

The true memory usage is RSS - SHR in this case -- do you also see a large SHR in your test?

ianwhale commented 6 years ago

Ah. My apologies. That stackexchange is the same case as mine. The processes are all indeed using the shared memory: http://i.imgur.com/SDbiGBI.png

Thanks for your time and the information!

rainwoodman / sharedmem

Data Copying #12