Open giuse opened 6 years ago
You can see it is not as easy as defining the returning data structure as a closure and send it to all children (you will need to $ gem install parallel
first):
require 'parallel'
require 'numo/narray'
ary = Numo::Int32.zeros(5)
Parallel.map(0...ary.size) { |i| puts i; ary[i] = i } # => asynchronously prints numbers between 0 and 4
p ary # => Numo::Int32#shape=[5] [0, 0, 0, 0, 0]
The stack of the parent is entirely duplicated in each child. Each child has (and accesses) its own distinct copy of ary
. A way is needed for a child to point back into the parent's space for this optimization to work.
Parallel array computing is an interesting topic but distant target. Python's Dask will be a good model.
Dask and similar methods aimed at distributed computing (e.g. ruby-spark
) should already work well with the current marshaling. For local parallelization though, accessing the underlying C pointer would constitute a huge improvement.
Is there any chance to see direct access to the C pointer (both get and set) in the near future? I would gladly contribute a patch if you could point me towards how would you do it.
I do not well understand the relationship between parallelization and C pointer access. In Ruby level, it is hard to allow C pointer access due to Ruby's memory management. I think it is not impossible but needs a quite careful design.
Your example seems invalid because two processes do not share the memory. Even after the child process writes back to the memory, it does not change the memory in the parent process. c.f., https://stackoverflow.com/questions/26534613/about-pointers-after-fork
Relationship between parallelization and C pointers: sometimes you want to compute complex objects in child processes to parallelize the execution. Then the result need to go back to the parent process. In my case, I run (slow) Atari simulators in parallel in children processes, each controlled by a neural network; at the end of the simulation, some of the images generated (represented as Numo::NArray
s) need to be passed back to the parent process. With marshaling, this implies encoding each pixel matrix into a string, send the string to the parent process, then the parent need to decode the string again into a data matrix. As the images are large, this is highly inefficient if parent and child are on the same machine. Using shared memory between processes instead, the child could write directly the images in a piece of memory accessible (allocated) by the parent, and avoid entirely the expensive marshaling process. I hope this makes sense to you.
Sharing memory between parent and child: you are correct, that was the goal of my example. You are also correct on the reason: I forgot about virtual memory management. The answer to that, is that there is a whole set of C primitives allowing memory sharing between processes.
https://stackoverflow.com/questions/13274786/how-to-share-memory-between-process-fork
Maybe exposing naked pointers is not even needed: if a NArray#new_shared
call allocates shared memory, the (OS) kernel should already recognize it and allow write access to it to the child, rather than the usual "copy on write". No further changes needed: no need for a separate class, the methods of NArray should find no difference, even the marshaling should stays the same; only change is that the children can write on parent-allocated data. What do you think?
Numo::Int8.new(5).seq.marshal_dump
=> [1, [5], 0, "\x00\x01\x02\x03\x04"]
But yet the binary representation is returned in a String object.
dump = Numo::Int8.new(5).seq.marshal_dump # => [1, [5], 0, "\x00\x01\x02\x03\x04"]
dump.last.class # => String
It is this sequence of characters which would ultimately be passed through the network, OS pipe, or any other communication device using the marshaling.
Interestingly, Numo::Int8
uses 8 bits per number, for a total of 5 bytes for the array in example, while its marshal dump binary encoding String uses 8 bits per character (or more depending on encoding), for a total of 15 bytes. All those are rough estimates of course, not including object wrapping and such.
I imagine the following minimal test would be enough for verification.
Please take this with a grain of salt though, I have not the Ruby-C competence to do that on top of my head or I would have included it myself.
"\xnn"
is a hexadecimal notation representing 1-byte character.
dump = Numo::Int8.new(5).seq.marshal_dump
=> [1, [5], 0, "\x00\x01\x02\x03\x04"]
dump.last.size
=> 5
"\x30\x61"
=> "0a"
"\x30\x61".size
=> 2
So dumped string is a copy of binary data, not encoded.
You are right, I don't know how I could forget the \xnn
representation. Thank you for correcting me!
This is important because I am working with one-byte-per-pixel images, so by using Numo::UInt8
I should automatically have the smallest representation possible being sent by marshaling.
I hope you will find time and will to investigate also the second point of my previous comment.
Your constant support and quick improvement is greatly appreciated, I am a happy Numo user :)
I found this article to be relevant: https://blog.rebased.pl/2017/12/27/writing-c-and-sharing-memory.html
In modern machine learning, parallel computation is a necessity. In Ruby this currently implies using methods based on
#fork
rather than#thread
, since the goal is distributing the computation on multiple cores and it is thus needed to avoid the GIL.A typical example is the gem
parallel
: it launches a number of workers, send chunks of input to compute to each in turn as they are found free, then collects the return values from each. By "collects" I mean that the data must be passed from the worker to the original process: only the latter will continue the computation, while the workers will terminate after job completion.This passage of data is traditionally done by marshaling. In standard marshaling (as currently implemented by Numo) the data of the matrix is passed in some string form (usually binary). This method is designed for data dumping and network communication; it is highly inefficient if the workers and the original process both reside on the same machine.
What would make a breakthrough for machine learning performance in Numo would be a marshaling technique that allows it to retrieve the result of other processes' computations by passing C data pointers.
The major challenge here is that at child termination its allocated memory is freed. In order for this special marshaling technique to be successful, the children must receive a pointer to a memory area allocated by the parent, then build a Numo array out of it, allowing it to store the results transparently.
Example: 2 children processes are used to compute a 3x3 matrix each:
[0,true,true]
and[1,true,true]
[3,3]
using the memory pointers coming from the parentThe process might even be wrapped in a
Numo::Parallel
module for extra convenience (the whole gemparallel
is itself a few hundred lines of code).