ruby-numo / numo-narray

Ruby/Numo::NArray - New NArray class library
http://ruby-numo.github.io/narray/
BSD 3-Clause "New" or "Revised" License
418 stars 42 forks source link

Optimize marshaling in parallel computation by allowing constructing Numo arrays from pre-allocated data pointers, such as coming from views. #84

Open giuse opened 6 years ago

giuse commented 6 years ago

In modern machine learning, parallel computation is a necessity. In Ruby this currently implies using methods based on #fork rather than #thread, since the goal is distributing the computation on multiple cores and it is thus needed to avoid the GIL.

A typical example is the gem parallel: it launches a number of workers, send chunks of input to compute to each in turn as they are found free, then collects the return values from each. By "collects" I mean that the data must be passed from the worker to the original process: only the latter will continue the computation, while the workers will terminate after job completion.

This passage of data is traditionally done by marshaling. In standard marshaling (as currently implemented by Numo) the data of the matrix is passed in some string form (usually binary). This method is designed for data dumping and network communication; it is highly inefficient if the workers and the original process both reside on the same machine.

What would make a breakthrough for machine learning performance in Numo would be a marshaling technique that allows it to retrieve the result of other processes' computations by passing C data pointers.

The major challenge here is that at child termination its allocated memory is freed. In order for this special marshaling technique to be successful, the children must receive a pointer to a memory area allocated by the parent, then build a Numo array out of it, allowing it to store the results transparently.

Example: 2 children processes are used to compute a 3x3 matrix each:

  1. The parent allocates a Numo array with shape [2,3,3]
  2. Each child is passed the pointer to one of the elements of such array, i.e. a reference to the memory pointed by the views [0,true,true] and [1,true,true]
  3. The child builds a Numo array with shape [3,3] using the memory pointers coming from the parent
  4. The child writes data on its array, transparently, and terminates: no backwards marshaling is required
  5. The parent waits for all children to terminate, then reads their results on its original 3x2x2 array

The process might even be wrapped in a Numo::Parallel module for extra convenience (the whole gem parallel is itself a few hundred lines of code).

giuse commented 6 years ago

You can see it is not as easy as defining the returning data structure as a closure and send it to all children (you will need to $ gem install parallel first):

require 'parallel'
require 'numo/narray'
ary = Numo::Int32.zeros(5)
Parallel.map(0...ary.size) { |i| puts i; ary[i] = i } # => asynchronously prints numbers between 0 and 4
p ary # => Numo::Int32#shape=[5] [0, 0, 0, 0, 0]

The stack of the parent is entirely duplicated in each child. Each child has (and accesses) its own distinct copy of ary. A way is needed for a child to point back into the parent's space for this optimization to work.

masa16 commented 6 years ago

Parallel array computing is an interesting topic but distant target. Python's Dask will be a good model.

giuse commented 6 years ago

Dask and similar methods aimed at distributed computing (e.g. ruby-spark) should already work well with the current marshaling. For local parallelization though, accessing the underlying C pointer would constitute a huge improvement.

Is there any chance to see direct access to the C pointer (both get and set) in the near future? I would gladly contribute a patch if you could point me towards how would you do it.

masa16 commented 6 years ago

I do not well understand the relationship between parallelization and C pointer access. In Ruby level, it is hard to allow C pointer access due to Ruby's memory management. I think it is not impossible but needs a quite careful design.

Your example seems invalid because two processes do not share the memory. Even after the child process writes back to the memory, it does not change the memory in the parent process. c.f., https://stackoverflow.com/questions/26534613/about-pointers-after-fork

giuse commented 6 years ago
masa16 commented 6 years ago
Numo::Int8.new(5).seq.marshal_dump
=> [1, [5], 0, "\x00\x01\x02\x03\x04"]
giuse commented 6 years ago
masa16 commented 6 years ago

"\xnn" is a hexadecimal notation representing 1-byte character.

dump = Numo::Int8.new(5).seq.marshal_dump
=> [1, [5], 0, "\x00\x01\x02\x03\x04"]
dump.last.size
=> 5

"\x30\x61"
=> "0a"
"\x30\x61".size
=> 2

So dumped string is a copy of binary data, not encoded.

giuse commented 6 years ago

You are right, I don't know how I could forget the \xnn representation. Thank you for correcting me! This is important because I am working with one-byte-per-pixel images, so by using Numo::UInt8 I should automatically have the smallest representation possible being sent by marshaling.

I hope you will find time and will to investigate also the second point of my previous comment.

Your constant support and quick improvement is greatly appreciated, I am a happy Numo user :)

giuse commented 6 years ago

I found this article to be relevant: https://blog.rebased.pl/2017/12/27/writing-c-and-sharing-memory.html