Parallelization discussion

UniqueFool commented 8 years ago

Referring to the BFS talks, I would like to look at the solver and specifically investigate how much work it would be to get a list of all "independent" specifications - i.e. those not used anywhere else, and then hand those off to the solver separately, i.e. using an array of specifications and a forEach call to invoke the solver as a callback

Once that works, I would like to investigate doing that in parallel using a pool of worker threads.

The final step would be doing this recursively for specifications that depend on other specifications, i.e. investigating if there is any global solver state that needs access serialized, or preferably turning everything into a map/reduce style workflow that can be processed in functional programming style or using some kind of message-passing to run the solver async.

Pseudo code:

var specs = getSpecifications()
specs.forEach( function(spec) {
invokeSolver(spec);
});

UniqueFool commented 8 years ago

As far as I understand, you are (at least) half-way there, i.e. referring to Specification.js and the validateSubspecs() function there - it all seems a little monolithic currently, so that I am not yet clear if/how to reuse that code from another thread/process (it just occurred to me that clustering via fork/exec may be simpler if there is any global state that must not be tampered with )

omphalos commented 8 years ago

it all seems a little monolithic currently

Yeah I admit I need to refactor this some more.

how much work it would be to get a list of all "independent" specifications - i.e. those not used anywhere else

This shouldn't be too difficult. Maybe to start I can come up with a function that returns this and we can think about the parallelization some more. I wonder if it might make more sense for the parallelization part to be handled outside of inductive.js, maybe as a plugin library or something. Not sure though.

UniqueFool commented 8 years ago

that sounds about right - there is no need for it to be handled by inductive.js itself necessarily - as long as the underlying code provides APIs for getting access to the corresponding information, for example:

list of all specs
list of sub-specs per spec

and then, the Solver API would need to support specs to be handled in a read-only fashion, so that it won't matter if the Solver runs in the same threads or another thread/process

Besides, most parallelization related JavaScript modules work in terms of JSON I/O for passing data between workers.

Thus, if the Solver could internally be told to use/return JSON, that would simplify using it in conjunction with multi-processing, including even in conjunction with standalone modules specifically for clustering: https://www.sitepoint.com/how-to-create-a-node-js-cluster-for-speeding-up-your-apps/

omphalos commented 8 years ago

I started to set up something on these lines on the 'parallel' branch if you want to take a look.

UniqueFool commented 8 years ago

thank you for the update and the pointer, I have just looked at the corresponding branch/commit, and I think it's looking very promising already

I still need to look the specifics of some of the building blocks you are using there, i.e. the deferredAPICall stuff.

https://github.com/omphalos/inductive.js/commit/d498f00517555d02b8d1c859d96ec02e69034900

omphalos commented 8 years ago

The DeferredApiCall is used to translate the DSL syntactic sugar into OO-calls.

For example:

declare('square',
  given(3, shouldReturn(9)),
  use('*'))

This effectively gets translated by DeferredApiCall into:

var square = declare('square')
var scenario = square.given(3)
var expectation = scenario.shouldReturn(9)
square.use('Number+')

UniqueFool commented 8 years ago

I started to set up something on these lines on the 'parallel' branch if you want to take a look.

I was thinking that for the retrieval of subSpecs, it would actually be useful to return a hash/object with spec-specific information, including the time it took to solve that spec.

The main idea here is that I was running testSolves.i.js and I think I can use that as a testbed, for coming up with a few heavily nested specs, and then time the execution of the solver and compare it to using a thread pool for solving nested/independent specs separately in parallel

for instance, right now, testSolves.i.js is taking roughly 8 seconds for me, so that would be a suitable test case to see if there are any merits to parallelizing the solver for sub-specs.

omphalos commented 8 years ago

@UniqueFool I added solve steps to testParallel.i.js and included duration metadata in the solution. To set up the solve step so that it is easily callable I had to move a little code around.

Also, you had asked about JSON data for serialization. Right now there is spec.solution.code which has a string that can be sent across process boundaries. It contains the raw code for the solution found by the solver. You can see examples by running tests in the branch. I think this should be sufficient if you are only parallelizing top-level specs. If you want to parallelize specs lower in the dependency tree then that will take more work, but it's probably doable with some thought. Do you think this pattern, as it stands, is going to be sufficient for your needs?

UniqueFool commented 8 years ago

I think that's looking very good already, we probably need to hook this up to some "tasking" scheme and actually let it solve a few specs to see how well it works already.

Referring to multi-threading, you mentioned the idea of using a plugin/separate module for that, is there anything you'd prefer or would not like to see used at all ?

For example, one option may be using the async module: https://github.com/caolan/async

omphalos commented 8 years ago

is there anything you'd prefer or would not like to see used at all ?

I would say go ahead and use whatever you like.

As a side note I was thinking about how caching would need to work with parallelization. Right now the cache is stored in a file, which is read and written from the single process. A straightforward solution would be to override the default cache file , to avoid races. This can be done by overriding globalOptions.cacheFileName. Alternatively I could work on a synchronizing reads/writes to the cache, but that's a little more work and I'm not sure it's necessary. What do you think?

UniqueFool commented 8 years ago

Yeah, the caching thing also occurred to me, i.e. it would apply especially in the fork/exec or distributed setup - I was thinking to use some kind of md5/sha hash of the spec to be solved and use that as the filename (optionally requiring specs to have an "identifier" string to be parallelizable), so that racing would not be an issue, unless two workers would be solving the same spec concurrently and using the same identifier, at which point something else is obviously wrong ;-)

That way, inductive.js could use some unique attribute of the spec to come up with a temporary filename for the spec to be solved.

But maybe you are right and that's a bit too much work for now - I guess it makes sense to gather some experience with the parallel workflow and then revisit this.

UniqueFool commented 8 years ago

If you want to parallelize specs lower in the dependency tree then that will take more work, but it's probably doable with some thought.

not saying I really need this, just wanting to understand how difficult would this be - so could you possibly provide a few pointers so that I can take a look and see what's missing, i.e. dependency resolution in the form of nested vectors that I can solve in a forEach fashion for each dependency before converging ? The background is that I am thinking in terms of using a thread pool with workers and would need to assign "jobs", but those should obviously have some ordering, or things get messy quickly.

(however, don't bother if it's not exactly straightforward)

omphalos commented 8 years ago

testParallel.i.js has the dependency tree, which I think you would have to iterate over this to determine what needs to be solved.

The problem is sharing state between workers. Each spec has a solution. Ideally it would be possible to share this solution between specs. But the solution object doesn't lend itself to serialization as it contains prototype references and circular structures.

That said, it's probably possible to implement serialization functionality on the important parts of solution object. To this I'd have to implement serialize/deserialize functionality on all the Expression and Type prototypes in the library. I'd say this is a good chunk of work to do so, to be honest.

UniqueFool commented 8 years ago

ok, thanks for clarifying - don't bother then, was just asking to get a better understanding of things

omphalos / inductive.js

Parallelization discussion #4