gvegayon / parallel

PARALLEL: Stata module for parallel computing
https://rawgit.com/gvegayon/parallel/master/ado/parallel.html
MIT License
117 stars 26 forks source link

Memory error issues when working with large dataset #41

Closed RobertoLiebscher closed 8 years ago

RobertoLiebscher commented 8 years ago

Hi there,

I am working with Stata SE 13.1 on a large dataset (15 gigabyte) on a 64 bit machine with 32gb RAM. When I tried to run parallel over the four cores of my computer I received an error message

--------------------------------------------------------------------------------
Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : 2uctinfg15
Running at : C:\Users\wwa594\Documents\Promotion\Projects\Dualholding\Work\Do-Files
Randtype   : datetime
Waiting for the clusters to finish...
cluster 0004 Exited with error -198- while setting memory (view log)...
cluster 0001 Exited with error -198- while setting memory (view log)...
cluster 0002 Exited with error -198- while setting memory (view log)...
cluster 0003 Exited with error -198- while setting memory (view log)...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------

The code I tried looks like this:

encode manager, gen(managerdum) // Convert string into numeric variable

capture program drop myloop
program define myloop

//walk through each lead and find amount with this lead if lead is in of the lead* variables
forvalues i = 1/12 {
    gen leadamt`i' = .
    levelsof lead`i', local(leads)  
    foreach l of local leads {
        bysort managerdum obsq: egen hlpvar = total(principalUSD) if (lead1 == "`l'" | lead2 == "`l'" | lead3 == "`l'" | lead4 == "`l'" | lead5 == "`l'" | ///
         lead6 == "`l'" | lead7 == "`l'" | lead8 == "`l'" | lead9 == "`l'" | lead10 == "`l'" | lead11 == "`l'" | lead12 == "`l'") & obsdate == mindate
        bysort managerdum obsq: egen totamt = min(hlpvar)
        replace leadamt`i' = totamt if lead`i' == "`l'" & leadamt`i' == .
        drop totamt hlpvar
    }
}
end

capture parallel clean
cd "C:\Users\wwa594\Documents\"
parallel setclusters 4
sort managerdum
parallel, by(managerdum) programs(myloop): myloop

I discussed this issue on the Statalist before but the problem remains unsolved: http://www.statalist.org/forums/forum/general-stata-discussion/general/1352049-parallel-computing-with-stata-se-13-1-and-parallel-package-error-198-while-setting-memory

I am unable to replicate the error message with a publicly available dataset. But expanding the auto dataset and trying a simple task on indicates that the size of the dataset might cause the problem:

. sysuse auto, clear
(1978 Automobile Data)

. tab rep78, miss

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.70        2.70
          2 |          8       10.81       13.51
          3 |         30       40.54       54.05
          4 |         18       24.32       78.38
          5 |         11       14.86       93.24
          . |          5        6.76      100.00
------------+-----------------------------------
      Total |         74      100.00

. expand 2000000
(147999926 observations created)

.
. capture program drop myloop

. program define myloop
  1.   levelsof rep78, local(reps)
  2.         foreach r of local reps {
  3.                         sum mpg if rep78==`r'    
  4.                         gen newvar`r' = r(sum)
  5.   }
  6. end

.
. cd "C:\Users\wwa594\Documents"
C:\Users\wwa594\Documents

. sort rep78

. parallel setclusters 4
N Clusters: 4
Stata dir:  C:\Program Files (x86)\Stata13/StataSE-64.exe

. parallel, by(rep78) programs(myloop): myloop
--------------------------------------------------------------------------------
Exporting the following program(s): myloop

myloop:
  1.   levelsof rep78, local(reps)
  2.         foreach r of local reps {
  3.                         sum mpg if rep78==`r'
  4.                         gen newvar`r' = r(sum)
  5.   }
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : 1mdx8agr10
Running at : C:\Users\wwa594\Documents
Randtype   : datetime
Waiting for the clusters to finish...
  0
cluster 0001 has exited without error...
  0
cluster 0004 has exited without error...
  -3621
cluster 0003 has exited without error...
  -3621
cluster 0002 has exited without error...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------

. parallel clean

. tab rep78, miss

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |  4,000,000        4.55        4.55
          2 | 16,000,000       18.18       22.73
          4 | 36,000,000       40.91       63.64
          5 | 22,000,000       25.00       88.64
          . | 10,000,000       11.36      100.00
------------+-----------------------------------
      Total | 88,000,000      100.00

rep78 now has only four categories instead of five in the original dataset and the resulting dataset is 60 million observations short: 2 mio * 74 - 88 mio = 60 mio.

Do you have an idea what might have gone wrong here?

gvegayon commented 8 years ago

Were you able to solve this?

RobertoLiebscher commented 8 years ago

Dear George,

Thanks for asking and many thanks for uploading this nice package.

I still get an error message though with my 15gb sample but after downsizing it to 3gb it works fine. Here is the Statalist thread for this: http://www.statalist.org/forums/forum/general-stata-discussion/general/1352049-parallel-computing-with-stata-se-13-1-and-parallel-package-error-198-while-setting-memory

Best regards, Roberto

Roberto Liebscher Catholic University of Eichstaett-Ingolstadt Department of Business Administration Chair of Banking and Finance Auf der Schanz 49 D-85049 Ingolstadt Germany Phone: (+49)-841-937-21929 FAX: (+49)-841-937-22883 E-mail: roberto.liebscher@ku.de Internet: http://www.ku.de/wwf/lfb/

Am 8/10/2016 um 7:13 PM schrieb George G. Vega Yon:

Were you able to solve this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238936220, or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CClK9KIfy_dnQdLm86I3jUVvV22lCks5qegajgaJpZM4Jf0s9.

gvegayon commented 8 years ago

It is odd. Maybe there's something happening with mata underneath when you use big data... although, in my experience that hasn't been a problem. Perhaps the fact that you are using windows? Again, on linux machines this hasn't been a problem as I've used parallel with datasets of around ~20gigs or more (if I recall correctly). Anyway, I'm glad you worked it out and thanks for the question!

George G. Vega Yon +1 (626) 381 8171 http://www.its.caltech.edu/~gvegayon/

On Wed, Aug 10, 2016 at 12:10 PM, RobertoLiebscher <notifications@github.com

wrote:

Dear George,

Thanks for asking and many thanks for uploading this nice package.

I still get an error message though with my 15gb sample but after downsizing it to 3gb it works fine. Here is the Statalist thread for this: http://www.statalist.org/forums/forum/general-stata- discussion/general/1352049-parallel-computing-with-stata- se-13-1-and-parallel-package-error-198-while-setting-memory

Best regards, Roberto

Roberto Liebscher Catholic University of Eichstaett-Ingolstadt Department of Business Administration Chair of Banking and Finance Auf der Schanz 49 D-85049 Ingolstadt Germany Phone: (+49)-841-937-21929 FAX: (+49)-841-937-22883 E-mail: roberto.liebscher@ku.de Internet: http://www.ku.de/wwf/lfb/

Am 8/10/2016 um 7:13 PM schrieb George G. Vega Yon:

Were you able to solve this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238936220,

or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CClK9KIfy_ dnQdLm86I3jUVvV22lCks5qegajgaJpZM4Jf0s9.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238971944, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2is4IktKkPNy_NY4YQJcER5VF1Kx9hks5qeiIogaJpZM4Jf0s9 .

RobertoLiebscher commented 8 years ago

Dear George,

It seems to me that these problems arise if the dataset exceeds a certain size. For example the following code expands the dataset to a size of roughly 12gb. This results in the discussed memory error although my computer has 32gb memory. If I write expand 1000000 instead

Best regards, Roberto

expand 2000000 encode manager, gen(managerdum) //Find principal amount with same lead capture program drop myloop program define myloop forvalues i = 1/12{ gen leadamti' = . levelsof leadi', local(leads) foreach l of local leads { bysort managerdum obsq: egen hlpvar = total(principalUSD) if (lead1 == "l'" | lead2 == "l'" | lead3 == "l'" | lead4 == "l'" | lead5 == "l'" | /// lead6 == "l'" | lead7 == "l'" | lead8 == "l'" | lead9 == "l'" | lead10 == "l'" | lead11 == "l'" | lead12 == "l'") & obsdate == mindate bysort managerdum obsq: egen totamt = min(hlpvar) replace leadamti' = totamt if leadi' == "l'" & leadamti' == . drop totamt hlpvar } } end

capture parallel clean cd "C:\Users\wwa594\Documents\" parallel setclusters 8 sort managerdum parallel, by(managerdum) programs(myloop): myloop parallel clean

Am 8/10/2016 um 9:21 PM schrieb George G. Vega Yon:

It is odd. Maybe there's something happening with mata underneath when you use big data... although, in my experience that hasn't been a problem. Perhaps the fact that you are using windows? Again, on linux machines this hasn't been a problem as I've used parallel with datasets of around ~20gigs or more (if I recall correctly). Anyway, I'm glad you worked it out and thanks for the question!

George G. Vega Yon +1 (626) 381 8171 http://www.its.caltech.edu/~gvegayon/

On Wed, Aug 10, 2016 at 12:10 PM, RobertoLiebscher <notifications@github.com

wrote:

Dear George,

Thanks for asking and many thanks for uploading this nice package.

I still get an error message though with my 15gb sample but after downsizing it to 3gb it works fine. Here is the Statalist thread for this: http://www.statalist.org/forums/forum/general-stata- discussion/general/1352049-parallel-computing-with-stata- se-13-1-and-parallel-package-error-198-while-setting-memory

Best regards, Roberto

Roberto Liebscher Catholic University of Eichstaett-Ingolstadt Department of Business Administration Chair of Banking and Finance Auf der Schanz 49 D-85049 Ingolstadt Germany Phone: (+49)-841-937-21929 FAX: (+49)-841-937-22883 E-mail: roberto.liebscher@ku.de Internet: http://www.ku.de/wwf/lfb/

Am 8/10/2016 um 7:13 PM schrieb George G. Vega Yon:

Were you able to solve this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub

https://github.com/gvegayon/parallel/issues/41#issuecomment-238936220,

or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CClK9KIfy_ dnQdLm86I3jUVvV22lCks5qegajgaJpZM4Jf0s9.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238971944, or mute the thread

https://github.com/notifications/unsubscribe-auth/AA2is4IktKkPNy_NY4YQJcER5VF1Kx9hks5qeiIogaJpZM4Jf0s9 .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238975066, or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CCpR2prUeGjZfU-2xPKFsgmS_JgUzks5qeiSpgaJpZM4Jf0s9.

gvegayon commented 8 years ago

This output catches my attention:

Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : 1mdx8agr10
Running at : C:\Users\wwa594\Documents
Randtype   : datetime
Waiting for the clusters to finish...
  0
cluster 0001 has exited without error...
  0
cluster 0004 has exited without error...
  -3621
cluster 0003 has exited without error...
  -3621
cluster 0002 has exited without error...

Ping to @bquistorff : May this have something to do with the new implementation of the parallel_run command? Since it is windows, it might be the case that it closes the sessions before saving the file (see https://github.com/gvegayon/parallel/blob/c6bb22fbcad84f5901dc6d1b904da86f3a017af5/ado/parallel_write_do.mata#L275-L277) losing the data.

bquistorff commented 8 years ago

I'll check into this. Might be a work or

bquistorff commented 8 years ago

Using StataSE-64 v13 I was not able to produce the bug from your posted setups (with the auto dataset or the dataex-type setup). (I had less RAM so I could only expand to a tenth the size). We have fixed some bugs so you can try the latest code.

With your real data it seems like the bug happens reliably when the data is too large. To me the likeliest case is that it is an issue with lack of RAM and that an alternative solution is needed. With parallel, the parent instance of Stata loads the whole dataset in memory and the child processes together hold the same amount. The child processes also process the whole dataset so they need additional memory (though I'm not sure how much). So even though you have more RAM than the size of your dataset, you can quickly run out of RAM in practice. I think you should consider how to process your data in chunks serially when there is an issue doing it in parallel.

I'll mark this as closed for now, but feel free to re-open if we can find a way to reproduce the bug.