Open mcmalone opened 6 years ago
My guess is that your speculation is exactly correct -- with a large number of cores, some chunks do not contain all levels of that factor. You can check by running something like
clusterEvalQ(cls,levels(x$f)
There really is no good solution to that. You could try rerunning distribsplit() with scramble=T, or even reallocating some rows by hand. But in any case, you will be getting large standard errors for that level, since it is too rare to get a good estimate.
Ok, well I'm thinking 20 cores should be okay with the full sample (nothing special about the number 20, so I can bump it down later -- just want the regressions to run fast enough).
Thanks!
Hello,
There doesn't seem to be many threads/questions regarding
partools
, and I couldn't find this issue addressed in the vignette.I am using the partools package to run linear regressions in parallel. I am doing this using the calm() function.
I'm using 20 cores on a 64gb node.
I receive errors when I run the calm() function, and I've isolated the problem to a single variable: agelvl. In the chunks, agelvl is stored as a character due to it's named levels, so I use factor() around it in the function.
Here's the code:
lpmvbac2<-calm(cls,'vbac ~ factor(agelvl),data=nat[nat$prec==1,]')$tht
Here's the error:When I run the above code on my local machine (although, using 3 cores, instead of 20), I can't reproduce the error. This would suggest that the problem occurs in the chunking, specifically that a given level of agelvl is missing from one or more chunks.
However, here's a summary of agelvl in the unchunked data:
It seems unlikely to me that split into 20 chunks, any one of those 20 chunks would be missing any of these levels. I even checked each 20 chunks individually, and I don't see any levels missing:
Interestingly, when I split the data into 3 chunks and use 3 cores on the cluster, instead of 20, I get it to run, just as I'm able to on my local machine. I've also tested with 10 cores (error), and 5 cores (no error).
So, why does this problem occur when using 20 cores but not 3?
Also, in case this helps, all this testing has been done using a 5% sample. I've also done some testing with a 10% sample (I've tested with 20 cores and get the error, but with 10, no error). This leads me to conclude that the absolute value of any given level matters -- so 20 cores may work with the full dataset. But why? (Unless my conclusion is wrong.)
Thank you.