New execution logic: module level sequential execution and group all

stephenslab / dsc

Repo for Dynamic Statistical Comparisons project

https://stephenslab.github.io/dsc-wiki

MIT License

12 stars 12 forks source link

New execution logic: module level sequential execution and group all #185

Open gaow opened 5 years ago

gaow commented 5 years ago

@willwerscheid was wondering about a scenario where for a large data-set we can load the data once, and subsample from it many times to simulate smaller data. This boils down to some sequential execution logic, eg:

simulate:
   seed: R{1:10}
   data: '/path/to/data'
   ...

and we execute this module sequentially not in parallel (or parallel it in R session) so that we load data only once.

The biggest challenge is that we'd then have to move the for loop to module script level (language specific) rather than doing it at DSC level. It is some fundamental changes that existing code cannot be easily adapted into doing. But I can see the appeal of the request, so we need to think about how to best do it.

pcarbo commented 5 years ago

Why it is important to load the data only once?

pcarbo commented 5 years ago

@willwerscheid I think if loading a data set multiple times is a big issue, you should consider: (1) timing a way to make the data loading run faster (e.g., by saving in an efficient format), or (2) having a single module that creates all the data subsets in one go, and then can the subsets can be loaded in a separate module that is replicated many times.

This seems to me more a question about how to best design your DSC, and I think can be accomplished with the existing DSC features.

gaow commented 5 years ago

@pcarbo I agree with your assessment, although it is not completely impossible to address this at higher DSC level. I'm thinking of addressing things like that in DSC 2.0, along with the map-reduce notion that in the end all results flows to one node. A third thing worth doing is to allow for multiple outcomes per module instance -- that is the best way to address to the issue of benchmarking with command line tools in say bash.

In any case, point 2 and 3 are not relevant to @willwerscheid 's initial question but these are related in a way because they are exceptions or extensions to the parallel execution paradigm. So I'd like to keep this ticket open as a reminder of myself when I re-evaluate and design some of the execution logic down the road.