gvegayon / parallel

PARALLEL: Stata module for parallel computing
https://rawgit.com/gvegayon/parallel/master/ado/parallel.html
MIT License
118 stars 26 forks source link

Error cutting data when using by() #46

Closed bquistorff closed 8 years ago

bquistorff commented 8 years ago

The following produces an error on both current (master) and old (ssc) versions

. sysuse auto, clear
. sort rep78
. parallel setclusters 6
. parallel, by(rep78): egen max_rep78 = max(rep78)
Insufficient number of groups:
Can not divide the dataset into -6- clusters.

Even though there are 6 groups of rep78, the code in parallel_divide_index.mata can't correctly assign the groups to clusters. The code tries to make the cluster workloads about even in size, but it can't deal with the fact that the first two groups are smaller than what the average workload should be (_N/$PLL_CLUSTERS) and it can't correct itself after assigning two groups to the first cluster.

This dividing task is a general version of the classic "partition problem" which is hard to solve (NP-hard). Therefore I don't think we should try for optimality, but for simplicity and reasonableness. I propose we replace the by-splitting code with the following solution.

egen _`parallelid'grp = group(`xtstructure'), missing //identify groups. xtstructure holds by-vars
preserve
keep _`parallelid'grp
contract _`parallelid'grp //determine their sizes
sort _freq
gen _`parallelid'cut = mod(_n, ${PLL_CLUSTERS}) + 1 //assign them in a round-robin way
tempfile grp_to_cut_map
qui save `grp_to_cut_map'
restore
qui merge m:1 _`parallelid'grp using `grp_to_cut_map', keepusing(_`parallelid'cut) nogenerate
drop _`parallelid'grp

Does this seem like it will work? I've coded it up and it passes the tests fine.

gvegayon commented 8 years ago

Looks like a better solution. I only see it at https://github.com/gvegayon/parallel/blob/7c9914c094687dac78cb0a4beaad8e51a6a81831/ado/parallel.ado#L187 so I think you can go for it.

Thanks!

bquistorff commented 8 years ago

Done in 97f711c.