Optimize marshaling in parallel computation by allowing constructing Numo arrays from pre-allocated data pointers, such as coming from views.

giuse commented 6 years ago

In modern machine learning, parallel computation is a necessity. In Ruby this currently implies using methods based on #fork rather than #thread, since the goal is distributing the computation on multiple cores and it is thus needed to avoid the GIL.

A typical example is the gem parallel: it launches a number of workers, send chunks of input to compute to each in turn as they are found free, then collects the return values from each. By "collects" I mean that the data must be passed from the worker to the original process: only the latter will continue the computation, while the workers will terminate after job completion.

This passage of data is traditionally done by marshaling. In standard marshaling (as currently implemented by Numo) the data of the matrix is passed in some string form (usually binary). This method is designed for data dumping and network communication; it is highly inefficient if the workers and the original process both reside on the same machine.

What would make a breakthrough for machine learning performance in Numo would be a marshaling technique that allows it to retrieve the result of other processes' computations by passing C data pointers.

The major challenge here is that at child termination its allocated memory is freed. In order for this special marshaling technique to be successful, the children must receive a pointer to a memory area allocated by the parent, then build a Numo array out of it, allowing it to store the results transparently.

Example: 2 children processes are used to compute a 3x3 matrix each:

The parent allocates a Numo array with shape [2,3,3]
Each child is passed the pointer to one of the elements of such array, i.e. a reference to the memory pointed by the views [0,true,true] and [1,true,true]
The child builds a Numo array with shape [3,3] using the memory pointers coming from the parent
The child writes data on its array, transparently, and terminates: no backwards marshaling is required
The parent waits for all children to terminate, then reads their results on its original 3x2x2 array

The process might even be wrapped in a Numo::Parallel module for extra convenience (the whole gem parallel is itself a few hundred lines of code).

giuse commented 6 years ago

You can see it is not as easy as defining the returning data structure as a closure and send it to all children (you will need to $ gem install parallel first):

require 'parallel'
require 'numo/narray'
ary = Numo::Int32.zeros(5)
Parallel.map(0...ary.size) { |i| puts i; ary[i] = i } # => asynchronously prints numbers between 0 and 4
p ary # => Numo::Int32#shape=[5] [0, 0, 0, 0, 0]

The stack of the parent is entirely duplicated in each child. Each child has (and accesses) its own distinct copy of ary. A way is needed for a child to point back into the parent's space for this optimization to work.

masa16 commented 6 years ago

Parallel array computing is an interesting topic but distant target. Python's Dask will be a good model.

giuse commented 6 years ago

Dask and similar methods aimed at distributed computing (e.g. ruby-spark) should already work well with the current marshaling. For local parallelization though, accessing the underlying C pointer would constitute a huge improvement.

Is there any chance to see direct access to the C pointer (both get and set) in the near future? I would gladly contribute a patch if you could point me towards how would you do it.

masa16 commented 6 years ago

I do not well understand the relationship between parallelization and C pointer access. In Ruby level, it is hard to allow C pointer access due to Ruby's memory management. I think it is not impossible but needs a quite careful design.

Your example seems invalid because two processes do not share the memory. Even after the child process writes back to the memory, it does not change the memory in the parent process. c.f., https://stackoverflow.com/questions/26534613/about-pointers-after-fork

giuse commented 6 years ago

Relationship between parallelization and C pointers: sometimes you want to compute complex objects in child processes to parallelize the execution. Then the result need to go back to the parent process. In my case, I run (slow) Atari simulators in parallel in children processes, each controlled by a neural network; at the end of the simulation, some of the images generated (represented as Numo::NArrays) need to be passed back to the parent process. With marshaling, this implies encoding each pixel matrix into a string, send the string to the parent process, then the parent need to decode the string again into a data matrix. As the images are large, this is highly inefficient if parent and child are on the same machine. Using shared memory between processes instead, the child could write directly the images in a piece of memory accessible (allocated) by the parent, and avoid entirely the expensive marshaling process. I hope this makes sense to you.
Sharing memory between parent and child: you are correct, that was the goal of my example. You are also correct on the reason: I forgot about virtual memory management. The answer to that, is that there is a whole set of C primitives allowing memory sharing between processes. https://stackoverflow.com/questions/13274786/how-to-share-memory-between-process-fork Maybe exposing naked pointers is not even needed: if a NArray#new_shared call allocates shared memory, the (OS) kernel should already recognize it and allow write access to it to the child, rather than the usual "copy on write". No further changes needed: no need for a separate class, the methods of NArray should find no difference, even the marshaling should stays the same; only change is that the children can write on parent-allocated data. What do you think?

masa16 commented 6 years ago

NArray#marshal_dump does not encode elements to string form, but just copies binary data.

Numo::Int8.new(5).seq.marshal_dump
=> [1, [5], 0, "\x00\x01\x02\x03\x04"]

It is interesting to use shared memory for numo-narray. But I do not know whether Ruby can handle shared memory under Ruby's GC system. It may be possible using one of mmap-related modules.

giuse commented 6 years ago

But yet the binary representation is returned in a String object.
```
dump = Numo::Int8.new(5).seq.marshal_dump # => [1, [5], 0, "\x00\x01\x02\x03\x04"]
dump.last.class # => String
```
It is this sequence of characters which would ultimately be passed through the network, OS pipe, or any other communication device using the marshaling. Interestingly, Numo::Int8 uses 8 bits per number, for a total of 5 bytes for the array in example, while its marshal dump binary encoding String uses 8 bits per character (or more depending on encoding), for a total of 15 bytes. All those are rough estimates of course, not including object wrapping and such.
I imagine the following minimal test would be enough for verification.
- Find the C function allocating the memory for a new NArray, copy-paste the code, rename the function
- Edit the allocation in new function to use the shared memory allocation command correspondent to the normal allocation currently used
- Edit the reference that makes the C function available in Ruby such that the new function is available
- Run my example from my first comment to this issue, edited to call the shared allocation initializer
Please take this with a grain of salt though, I have not the Ruby-C competence to do that on top of my head or I would have included it myself.

masa16 commented 6 years ago

"\xnn" is a hexadecimal notation representing 1-byte character.

dump = Numo::Int8.new(5).seq.marshal_dump
=> [1, [5], 0, "\x00\x01\x02\x03\x04"]
dump.last.size
=> 5

"\x30\x61"
=> "0a"
"\x30\x61".size
=> 2

So dumped string is a copy of binary data, not encoded.

giuse commented 6 years ago

You are right, I don't know how I could forget the \xnn representation. Thank you for correcting me! This is important because I am working with one-byte-per-pixel images, so by using Numo::UInt8 I should automatically have the smallest representation possible being sent by marshaling.

I hope you will find time and will to investigate also the second point of my previous comment.

Your constant support and quick improvement is greatly appreciated, I am a happy Numo user :)

giuse commented 6 years ago

I found this article to be relevant: https://blog.rebased.pl/2017/12/27/writing-c-and-sharing-memory.html

ruby-numo / numo-narray

Optimize marshaling in parallel computation by allowing constructing Numo arrays from pre-allocated data pointers, such as coming from views. #84