sdiehl / numpush

Shared Memory Numpy ( Deprecated, See https://github.com/ContinuumIO/blaze )
https://github.com/ContinuumIO/blaze
MIT License
22 stars 0 forks source link

blosc.so is not created during setup process #1

Open gerigk opened 12 years ago

gerigk commented 12 years ago

The make file within /include/blosc isn't executed together with setup.py and I received a "blosc.so missing" error. Running make within the folder and then running setup.py results in a successful installation.

sdiehl commented 12 years ago

You are correct the build does fail in this case.

Right now this project is more of a grab-bag of one-off modules and functions than a coherent standalone library.

sdiehl commented 12 years ago

Right now my energy is being directed toward internal ContinuumIO projects which hopefully should make this library unnecessary in the future.

gerigk commented 12 years ago

I'm looking forward to seeing the projects :)

Do you maybe have a hint how to share pandas DataFrames containing object columns among processes ? I found numpush searching for this but it doesn't seem to work with object columns either.

On Fri, Oct 5, 2012 at 4:49 PM, Stephen Diehl notifications@github.comwrote:

Right now my energy is being directed toward internal ContinuumIO projects which hopefully should make this library unnecessary in the future.

— Reply to this email directly or view it on GitHubhttps://github.com/sdiehl/numpush/issues/1#issuecomment-9177957.

sdiehl commented 12 years ago

In the case where you have a single numpy array that that makes up a homogeneous Pandas dataframe ( i.e. no mixed dtypes ) then you can mmap the bytes of the array in anonymous memory ( pass in -1 instead of a fileno to mmap() with the MAP_SHARED flag) and they will reference the same memory across fork() operations. You can then operate on them across processes and mmap will take care of msync() operations so that all forks see the same data.

The undocumented multiprocessing.heap basically does this and thats what we use here in Numpush: https://github.com/sdiehl/numpush/blob/master/numpush/shmem.py#L15

The caveat to this is that most real-world Pandas DataFrames are heterogeneous blocks and Wes has done some subtle things to get them crank every bit of performance out of them so this approach doesn't generalize.

Basically we need a better data structures that supports data parallelism natively. Travis has talked about this a bit in his presentations: http://www.slideshare.net/pycontw/largescale-arrayoriented-computing-with-python .

TLDR; We need better datastructures that can support this sort of data parallelism natively instead of piles of hacks like Numpush.