The following produces an error on both current (master) and old (ssc) versions
. sysuse auto, clear
. sort rep78
. parallel setclusters 6
. parallel, by(rep78): egen max_rep78 = max(rep78)
Insufficient number of groups:
Can not divide the dataset into -6- clusters.
Even though there are 6 groups of rep78, the code in parallel_divide_index.mata can't correctly assign the groups to clusters. The code tries to make the cluster workloads about even in size, but it can't deal with the fact that the first two groups are smaller than what the average workload should be (_N/$PLL_CLUSTERS) and it can't correct itself after assigning two groups to the first cluster.
This dividing task is a general version of the classic "partition problem" which is hard to solve (NP-hard). Therefore I don't think we should try for optimality, but for simplicity and reasonableness. I propose we replace the by-splitting code with the following solution.
egen _`parallelid'grp = group(`xtstructure'), missing //identify groups. xtstructure holds by-vars
preserve
keep _`parallelid'grp
contract _`parallelid'grp //determine their sizes
sort _freq
gen _`parallelid'cut = mod(_n, ${PLL_CLUSTERS}) + 1 //assign them in a round-robin way
tempfile grp_to_cut_map
qui save `grp_to_cut_map'
restore
qui merge m:1 _`parallelid'grp using `grp_to_cut_map', keepusing(_`parallelid'cut) nogenerate
drop _`parallelid'grp
Does this seem like it will work? I've coded it up and it passes the tests fine.
The following produces an error on both current (master) and old (ssc) versions
Even though there are 6 groups of
rep78
, the code inparallel_divide_index.mata
can't correctly assign the groups to clusters. The code tries to make the cluster workloads about even in size, but it can't deal with the fact that the first two groups are smaller than what the average workload should be (_N/$PLL_CLUSTERS
) and it can't correct itself after assigning two groups to the first cluster.This dividing task is a general version of the classic "partition problem" which is hard to solve (NP-hard). Therefore I don't think we should try for optimality, but for simplicity and reasonableness. I propose we replace the by-splitting code with the following solution.
Does this seem like it will work? I've coded it up and it passes the tests fine.