gvegayon / parallel

PARALLEL: Stata module for parallel computing
https://rawgit.com/gvegayon/parallel/master/ado/parallel.html
MIT License
118 stars 26 forks source link

Run multiple commands in parallel without splitting data #67

Closed khawandc closed 6 years ago

khawandc commented 6 years ago

I am looking to run a few distinct commands in parallel on all observations in a single data set but do not know how to disable parallel's splitting of the data saet.

Expected behavior and actual behavior

clear
set obs 1000000
parallel setclusters 3

forvalues v = 1(1)3 {
gen x`v' = rnormal()
gen y`v' = x`v' + rnormal()
}

define testprog
forvalues v = 1(1)3 {
reg y`v' x`v'
di e(N)
predict fv`v'
}
end

parallel, prog(testprog): testprog

What I want is to run each command reg y1 x1, reg y2 x2, etc. on the full data set of 1,000,000 in parallel and then have them create a fitted value variable. However, parallel instead runs three tasks and runs each regression with 333,333 observations.

Any easy workarounds?

Thanks for your time. Love the code/idea.

System information

Some relevant information

Output from creturn list:

bquistorff commented 6 years ago

We are fairly observation-based rather than column-based, but you should be able to accomplish what you want by doing:

  1. Switch to have the dataset be a list of identifiers for each variable you want to, and then in each parallel child, store the chunk you receive, then load the real dataset, and loop over the work to do and generate the subset of variables. This will look like this example.
  2. Add to the parallel call a outputopts option so that each child has a name of where to store its output.
  3. After parallel is done load the original dataset, and then bring in (likely via merge) all the generated variables in the ancillary files. Hope that helps.