gvegayon / parallel

PARALLEL: Stata module for parallel computing
https://rawgit.com/gvegayon/parallel/master/ado/parallel.html
MIT License
117 stars 26 forks source link

parallel do with large dataset #42

Closed alnahedh closed 8 years ago

alnahedh commented 8 years ago

I am running a large number of rolling panel regressions (unbalanced panel) and I am specifically collecting RMSE's from those regressions in a large dataset, and this nice package of course extremely expedites this process. So, I wrote a loop in a do file so that I can call it using parallel do loop.do parallel runs fine with no errors but I noticed that calling parallel to execute the loop returns different number of RMSE point estimates depending on the number of clusters I set. This is not the case when I run the loop sequentially without using this package. When I restrict the sample from 400,000 obs to 2,000 obs this no longer happens. The number of point estimates I get back is consistent when running the loop sequentially, but when executing using parallel I get different numbers depending on the cluster size, and the smaller the clusters the higher the number of point estimates and they almost converge when I set the cluster size to 2. Right now I am running 48 clusters on the server. Any clues as to why this is happening? I am using the latest version of parallel available here, along with Stata 13 MP. I've tried this with a number of different operating systems, school servers, etc. with similar results.

My loop.do file has the following form

levelsof lvls, local(lvls)
foreach l of local lvls {
    forvalues t = 1952(1)2015 {  
      capture xtreg y x1 x2 if lvls == `l' & inlist(yr,`t',`t'-1,`t'-2) , fe
      capture replace z1 = `e(rmse)' if lvls == `l' & yr==`t'
}
    forvalues t = 1954(1)2015 {   
      capture xtreg y x3 x4 if lvls == `l' & inlist(yr,`t',`t'-1,`t'-2,`t'-3,`t'-4) , fe
      capture replace z2 = `e(rmse)' if lvls == `l' & yr==`t'
} 
}

Kindly let me know if I've missed anything from my problem explanation. Thanks!

alnahedh commented 8 years ago

I think I figured out the cause, but not the solution unfortunately.. calling parallel to execute the loop splits the data into N (N = number of clusters) seperate .dta files, with each .dta file being a subset of the original data. In a panel setting, this causes the rolling regressions in each cluster to have different data points (and hence different resulting regression estimates) compared to running the regressions sequentially in the original data.. Is there a way to make parallel not create subset data sets but rather have the entire data sent into each cluster?

gvegayon commented 8 years ago

Just save the original dta file on your computer and then load it within your program. Furthermore, you can use parallel's internal macros to save the data, e.g.

// This line is new
use "pathtomyfile.dta"

levelsof lvls, local(lvls)
foreach l of local lvls {
    forvalues t = 1952(1)2015 {  
      capture xtreg y x1 x2 if lvls == `l' & inlist(yr,`t',`t'-1,`t'-2) , fe
      capture replace z1 = `e(rmse)' if lvls == `l' & yr==`t'
}
    forvalues t = 1954(1)2015 {   
      capture xtreg y x3 x4 if lvls == `l' & inlist(yr,`t',`t'-1,`t'-2,`t'-3,`t'-4) , fe
      capture replace z2 = `e(rmse)' if lvls == `l' & yr==`t'
} 
}

// This line is also new: This will store files 1 through NCLUSTERS
save file_number_$pll_instance`'.dta, replace

Hope it helps