Open gerigk opened 12 years ago
You are correct the build does fail in this case.
Right now this project is more of a grab-bag of one-off modules and functions than a coherent standalone library.
Right now my energy is being directed toward internal ContinuumIO projects which hopefully should make this library unnecessary in the future.
I'm looking forward to seeing the projects :)
Do you maybe have a hint how to share pandas DataFrames containing object columns among processes ? I found numpush searching for this but it doesn't seem to work with object columns either.
On Fri, Oct 5, 2012 at 4:49 PM, Stephen Diehl notifications@github.comwrote:
Right now my energy is being directed toward internal ContinuumIO projects which hopefully should make this library unnecessary in the future.
— Reply to this email directly or view it on GitHubhttps://github.com/sdiehl/numpush/issues/1#issuecomment-9177957.
In the case where you have a single numpy array that that makes up a homogeneous Pandas dataframe ( i.e. no mixed dtypes ) then you can mmap the bytes of the array in anonymous memory ( pass in -1
instead of a fileno to mmap()
with the MAP_SHARED
flag) and they will reference the same memory across fork() operations. You can then operate on them across processes and mmap will take care of msync() operations so that all forks see the same data.
The undocumented multiprocessing.heap
basically does this and thats what we use here in Numpush: https://github.com/sdiehl/numpush/blob/master/numpush/shmem.py#L15
The caveat to this is that most real-world Pandas DataFrames are heterogeneous blocks and Wes has done some subtle things to get them crank every bit of performance out of them so this approach doesn't generalize.
Basically we need a better data structures that supports data parallelism natively. Travis has talked about this a bit in his presentations: http://www.slideshare.net/pycontw/largescale-arrayoriented-computing-with-python .
TLDR; We need better datastructures that can support this sort of data parallelism natively instead of piles of hacks like Numpush.
The make file within /include/blosc isn't executed together with setup.py and I received a "blosc.so missing" error. Running make within the folder and then running setup.py results in a successful installation.