Memory error issues when working with large dataset

RobertoLiebscher commented 8 years ago

Hi there,

I am working with Stata SE 13.1 on a large dataset (15 gigabyte) on a 64 bit machine with 32gb RAM. When I tried to run parallel over the four cores of my computer I received an error message

--------------------------------------------------------------------------------
Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : 2uctinfg15
Running at : C:\Users\wwa594\Documents\Promotion\Projects\Dualholding\Work\Do-Files
Randtype   : datetime
Waiting for the clusters to finish...
cluster 0004 Exited with error -198- while setting memory (view log)...
cluster 0001 Exited with error -198- while setting memory (view log)...
cluster 0002 Exited with error -198- while setting memory (view log)...
cluster 0003 Exited with error -198- while setting memory (view log)...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------

The code I tried looks like this:

encode manager, gen(managerdum) // Convert string into numeric variable

capture program drop myloop
program define myloop

//walk through each lead and find amount with this lead if lead is in of the lead* variables
forvalues i = 1/12 {
    gen leadamt`i' = .
    levelsof lead`i', local(leads)  
    foreach l of local leads {
        bysort managerdum obsq: egen hlpvar = total(principalUSD) if (lead1 == "`l'" | lead2 == "`l'" | lead3 == "`l'" | lead4 == "`l'" | lead5 == "`l'" | ///
         lead6 == "`l'" | lead7 == "`l'" | lead8 == "`l'" | lead9 == "`l'" | lead10 == "`l'" | lead11 == "`l'" | lead12 == "`l'") & obsdate == mindate
        bysort managerdum obsq: egen totamt = min(hlpvar)
        replace leadamt`i' = totamt if lead`i' == "`l'" & leadamt`i' == .
        drop totamt hlpvar
    }
}
end

capture parallel clean
cd "C:\Users\wwa594\Documents\"
parallel setclusters 4
sort managerdum
parallel, by(managerdum) programs(myloop): myloop

I discussed this issue on the Statalist before but the problem remains unsolved: http://www.statalist.org/forums/forum/general-stata-discussion/general/1352049-parallel-computing-with-stata-se-13-1-and-parallel-package-error-198-while-setting-memory

I am unable to replicate the error message with a publicly available dataset. But expanding the auto dataset and trying a simple task on indicates that the size of the dataset might cause the problem:

. sysuse auto, clear
(1978 Automobile Data)

. tab rep78, miss

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.70        2.70
          2 |          8       10.81       13.51
          3 |         30       40.54       54.05
          4 |         18       24.32       78.38
          5 |         11       14.86       93.24
          . |          5        6.76      100.00
------------+-----------------------------------
      Total |         74      100.00

. expand 2000000
(147999926 observations created)

.
. capture program drop myloop

. program define myloop
  1.   levelsof rep78, local(reps)
  2.         foreach r of local reps {
  3.                         sum mpg if rep78==`r'    
  4.                         gen newvar`r' = r(sum)
  5.   }
  6. end

.
. cd "C:\Users\wwa594\Documents"
C:\Users\wwa594\Documents

. sort rep78

. parallel setclusters 4
N Clusters: 4
Stata dir:  C:\Program Files (x86)\Stata13/StataSE-64.exe

. parallel, by(rep78) programs(myloop): myloop
--------------------------------------------------------------------------------
Exporting the following program(s): myloop

myloop:
  1.   levelsof rep78, local(reps)
  2.         foreach r of local reps {
  3.                         sum mpg if rep78==`r'
  4.                         gen newvar`r' = r(sum)
  5.   }
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : 1mdx8agr10
Running at : C:\Users\wwa594\Documents
Randtype   : datetime
Waiting for the clusters to finish...
  0
cluster 0001 has exited without error...
  0
cluster 0004 has exited without error...
  -3621
cluster 0003 has exited without error...
  -3621
cluster 0002 has exited without error...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------

. parallel clean

. tab rep78, miss

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |  4,000,000        4.55        4.55
          2 | 16,000,000       18.18       22.73
          4 | 36,000,000       40.91       63.64
          5 | 22,000,000       25.00       88.64
          . | 10,000,000       11.36      100.00
------------+-----------------------------------
      Total | 88,000,000      100.00

rep78 now has only four categories instead of five in the original dataset and the resulting dataset is 60 million observations short: 2 mio * 74 - 88 mio = 60 mio.

Do you have an idea what might have gone wrong here?

gvegayon commented 8 years ago

Were you able to solve this?

RobertoLiebscher commented 8 years ago

Dear George,

Thanks for asking and many thanks for uploading this nice package.

I still get an error message though with my 15gb sample but after downsizing it to 3gb it works fine. Here is the Statalist thread for this: http://www.statalist.org/forums/forum/general-stata-discussion/general/1352049-parallel-computing-with-stata-se-13-1-and-parallel-package-error-198-while-setting-memory

Best regards, Roberto

Roberto Liebscher Catholic University of Eichstaett-Ingolstadt Department of Business Administration Chair of Banking and Finance Auf der Schanz 49 D-85049 Ingolstadt Germany Phone: (+49)-841-937-21929 FAX: (+49)-841-937-22883 E-mail: roberto.liebscher@ku.de Internet: http://www.ku.de/wwf/lfb/

Am 8/10/2016 um 7:13 PM schrieb George G. Vega Yon:

Were you able to solve this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238936220, or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CClK9KIfy_dnQdLm86I3jUVvV22lCks5qegajgaJpZM4Jf0s9.

gvegayon commented 8 years ago

It is odd. Maybe there's something happening with mata underneath when you use big data... although, in my experience that hasn't been a problem. Perhaps the fact that you are using windows? Again, on linux machines this hasn't been a problem as I've used parallel with datasets of around ~20gigs or more (if I recall correctly). Anyway, I'm glad you worked it out and thanks for the question!

George G. Vega Yon +1 (626) 381 8171 http://www.its.caltech.edu/~gvegayon/

On Wed, Aug 10, 2016 at 12:10 PM, RobertoLiebscher <notifications@github.com

wrote:

Dear George,

Thanks for asking and many thanks for uploading this nice package.

I still get an error message though with my 15gb sample but after downsizing it to 3gb it works fine. Here is the Statalist thread for this: http://www.statalist.org/forums/forum/general-stata- discussion/general/1352049-parallel-computing-with-stata- se-13-1-and-parallel-package-error-198-while-setting-memory

Best regards, Roberto

Roberto Liebscher Catholic University of Eichstaett-Ingolstadt Department of Business Administration Chair of Banking and Finance Auf der Schanz 49 D-85049 Ingolstadt Germany Phone: (+49)-841-937-21929 FAX: (+49)-841-937-22883 E-mail: roberto.liebscher@ku.de Internet: http://www.ku.de/wwf/lfb/

Am 8/10/2016 um 7:13 PM schrieb George G. Vega Yon:

Were you able to solve this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238936220,

or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CClK9KIfy_ dnQdLm86I3jUVvV22lCks5qegajgaJpZM4Jf0s9.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238971944, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2is4IktKkPNy_NY4YQJcER5VF1Kx9hks5qeiIogaJpZM4Jf0s9 .

RobertoLiebscher commented 8 years ago

Dear George,

It seems to me that these problems arise if the dataset exceeds a certain size. For example the following code expands the dataset to a size of roughly 12gb. This results in the discussed memory error although my computer has 32gb memory. If I write expand 1000000 instead

it works fine. Do you have an idea what I can do here?

Best regards, Roberto

Example generated by -dataex-. To install: ssc install dataex clear input str3 manager str4 legal int(obsdate obsq) str5 security float principalUSD str15 lead1 str16 lead2 str19 lead3 str15 lead4 str13(lead5 lead6) str8 lead7 str1(lead8 lead9 lead10 lead11 lead12) int mindate "1" "631" 19509 213 "24364" 3.888025 "ABN Amro" "NIBC" "ING" "Lloyds" "Rabobank" "BNP Paribas" "" "" "" "" "" "" 19509 "1" "633" 20300 222 "29972" .8269426 "JP Morgan" "Goldman Sachs" "Bofa ML" "UniCredit" "Nomura" "Deutsche Bank" "" "" "" "" "" "" 20300 "1" "1172" 19549 222 "2642" 2.2236533 "JP Morgan" "" "" "" "" "" "" "" "" "" "" "" 19549 "21" "13" 20250 202 "5768" .9974425 "Barclays" "Morgan Stanley" "Deutsche Bank" "RBC" "" "" "" "" "" "" "" "" 20250 "21" "68" 18638 202 "20507" .09625 "Morgan Stanley" "Credit Suisse" "" "" "" "" "" "" "" "" "" "" 18638 "21" "95" 18451 202 "14509" .3238746 "Citigroup" "JP Morgan" "" "" "" "" "" "" "" "" "" "" 18451 "21" "111" 20368 197 "30079" .02565738 "RBC" "Goldman Sachs" "UBS" "Credit Suisse" "" "" "" "" "" "" "" "" 20368 "21" "964" 17994 197 "21172" 1.965 "Wachovia" "" "" "" "" "" "" "" "" "" "" "" 17994 "21" "760" 18837 197 "30485" 2.155691 "Barclays" "Goldman Sachs" "JP Morgan" "" "" "" "" "" "" "" "" "" 18837 "34" "279" 18723 205 "13847" .2105482 "Bank of America" "Deutsche Bank" "" "" "" "" "" "" "" "" "" "" 18723 "34" "282" 18456 205 "6229" 2.973389 "Morgan Stanley" "Goldman Sachs" "" "" "" "" "" "" "" "" "" "" 18456 "34" "500" 18360 201 "31389" .9923469 "Deutsche Bank" "Lehman" "" "" "" "" "" "" "" "" "" "" 18360 "34" "1096" 20275 222 "4462" 3.5535715 "Bofa ML" "Morgan Stanley" "Deutsche Bank" "" "" "" "" "" "" "" "" "" 20275 "34" "1103" 17994 197 "18359" .7107227 "Citicorp" "" "" "" "" "" "" "" "" "" "" "" 17994 "35" "214" 20180 221 "13317" 3.983148 "Goldman Sachs" "Deutsche Bank" "Morgan Stanley" "JP Morgan" "" "" "" "" "" "" "" "" 20180 "38" "228" 20188 221 "27645" .06746239 "Deutsche Bank" "Barclays" "" "" "" "" "" "" "" "" "" "" 20188 "38" "241" 19106 209 "3854" .2824447 "Credit Suisse" "JP Morgan" "" "" "" "" "" "" "" "" "" "" 19106 "41" "8" 18175 199 "855" .6739278 "Deutsche Bank" "Goldman Sachs" "" "" "" "" "" "" "" "" "" "" 18175 "41" "8" 20275 222 "17327" .8212923 "Barclays" "Fifth Third Bank" "Goldman Sachs" "JP Morgan" "Deutsche Bank" "Credit Suisse" "" "" "" "" "" "" 20275 "41" "1164" 19289 211 "20937" 1.946306 "Bank of America" "" "" "" "" "" "" "" "" "" "" "" 19289 "45" "727" 17731 194 "6599" 2.3678188 "Bear Stearns" "KeyBank" "" "" "" "" "" "" "" "" "" "" 17731 "56" "337" 19452 213 "6722" 2.25975 "Moelis & Co" "Bofa ML" "" "" "" "" "" "" "" "" "" "" 19452 "59" "272" 19736 216 "31689" .14422701 "Bofa ML" "JP Morgan" "Goldman Sachs" "" "" "" "" "" "" "" "" "" 19736 "59" "788" 17997 197 "3951" .01101668 "Goldman Sachs" "" "" "" "" "" "" "" "" "" "" "" 17997 "59" "788" 18819 206 "13549" 2.480357 "Goldman Sachs" "" "" "" "" "" "" "" "" "" "" "" 18819 "59" "794" 19942 218 "9791" .4933333 "Deutsche Bank" "Bofa ML" "Bank of Nova Scotia" "" "" "" "" "" "" "" "" "" 19942 "64" "388" 17994 197 "23712" .4195833 "GE Capital" "" "" "" "" "" "" "" "" "" "" "" 17994 "64" "389" 18116 198 "8641" .298875 "Jefferies" "" "" "" "" "" "" "" "" "" "" "" 18116 "67" "436" 17715 194 "24313" .4203345 "Bear Stearns" "JP Morgan" "" "" "" "" "" "" "" "" "" "" 17715 "71" "191" 17913 196 "26518" 1.999294 "Credit Suisse" "" "" "" "" "" "" "" "" "" "" "" 17913 "88" "1176" 20109 220 "21373" 1.4928774 "Goldman Sachs" "UBS" "Apollo" "Credit Suisse" "Bofa ML" "Citigroup" "Barclays" "" "" "" "" "" 20109 "91" "596" 18444 202 "30902" .249855 "SunTrust Bank" "" "" "" "" "" "" "" "" "" "" "" 18444 "96" "647" 17734 194 "18678" 5.66156 "Merrill Lynch" "" "" "" "" "" "" "" "" "" "" "" 17734 "103" "130" 17905 196 "25236" 3.111221 "Bank of America" "" "" "" "" "" "" "" "" "" "" "" 17905 "103" "854" 19269 211 "22695" .7217703 "Bofa ML" "Deutsche Bank" "Morgan Stanley" "Goldman Sachs" "Barclays" "" "" "" "" "" "" "" 19269 "120" "1108" 18911 207 "22213" .03351465 "Credit Suisse" "UBS" "" "" "" "" "" "" "" "" "" "" 18911 "120" "1113" 18547 203 "4725" .7147222 "Merrill Lynch" "RBC" "" "" "" "" "" "" "" "" "" "" 18547 "127" "630" 18473 202 "19146" .07162486 "ING" "" "" "" "" "" "" "" "" "" "" "" 18473 "135" "764" 18540 203 "1115" .1847826 "Barclays" "Morgan Stanley" "Deutsche Bank" "" "" "" "" "" "" "" "" "" 18540 "135" "814" 19817 217 "25923" 1.0660753 "JP Morgan" "Wells Fargo" "Bofa ML" "Morgan Stanley" "Goldman Sachs" "RBC" "" "" "" "" "" "" 19817 "145" "889" 20185 221 "27963" 4.7520103 "Credit Suisse" "Goldman Sachs" "JP Morgan" "Bank of America" "Deutsche Bank" "" "" "" "" "" "" "" 20185 "145" "449" 17724 221 "32106" .29048795 "Credit Suisse" "Merrill Lynch" "RBS" "JP Morgan" "Barclays" "" "" "" "" "" "" "" 17724 "145" "409" 18540 221 "16733" .9929332 "Deutsche Bank" "Goldman Sachs" "" "" "" "" "" "" "" "" "" "" 18540 "175" "413" 19570 199 "16093" 3 "Bofa ML" "Credit Suisse" "Goldman Sachs" "Barclays" "UBS" "" "" "" "" "" "" "" 19570 "175" "1149" 20370 199 "19748" 2.512727 "JP Morgan" "Bofa ML" "" "" "" "" "" "" "" "" "" "" 20370 "175" "1061" 18184 199 "23995" .3525108 "Bear Stearns" "" "" "" "" "" "" "" "" "" "" "" 18184 "175" "1074" 19731 199 "31960" .00260593 "Barclays" "GE Capital" "" "" "" "" "" "" "" "" "" "" 19731 "175" "1162" 17808 199 "22213" .25 "Credit Suisse" "UBS" "" "" "" "" "" "" "" "" "" "" 17808 end format %td obsdate format %tq obsq

expand 2000000 encode manager, gen(managerdum) //Find principal amount with same lead capture program drop myloop program define myloop forvalues i = 1/12{ gen leadamti' = . levelsof leadi', local(leads) foreach l of local leads { bysort managerdum obsq: egen hlpvar = total(principalUSD) if (lead1 == "l'" | lead2 == "l'" | lead3 == "l'" | lead4 == "l'" | lead5 == "l'" | /// lead6 == "l'" | lead7 == "l'" | lead8 == "l'" | lead9 == "l'" | lead10 == "l'" | lead11 == "l'" | lead12 == "l'") & obsdate == mindate bysort managerdum obsq: egen totamt = min(hlpvar) replace leadamti' = totamt if leadi' == "l'" & leadamti' == . drop totamt hlpvar } } end

capture parallel clean cd "C:\Users\wwa594\Documents\" parallel setclusters 8 sort managerdum parallel, by(managerdum) programs(myloop): myloop parallel clean

Am 8/10/2016 um 9:21 PM schrieb George G. Vega Yon:

It is odd. Maybe there's something happening with mata underneath when you use big data... although, in my experience that hasn't been a problem. Perhaps the fact that you are using windows? Again, on linux machines this hasn't been a problem as I've used parallel with datasets of around ~20gigs or more (if I recall correctly). Anyway, I'm glad you worked it out and thanks for the question!

George G. Vega Yon +1 (626) 381 8171 http://www.its.caltech.edu/~gvegayon/

On Wed, Aug 10, 2016 at 12:10 PM, RobertoLiebscher <notifications@github.com

wrote:

Dear George,

Thanks for asking and many thanks for uploading this nice package.

I still get an error message though with my 15gb sample but after downsizing it to 3gb it works fine. Here is the Statalist thread for this: http://www.statalist.org/forums/forum/general-stata- discussion/general/1352049-parallel-computing-with-stata- se-13-1-and-parallel-package-error-198-while-setting-memory

Best regards, Roberto

Roberto Liebscher Catholic University of Eichstaett-Ingolstadt Department of Business Administration Chair of Banking and Finance Auf der Schanz 49 D-85049 Ingolstadt Germany Phone: (+49)-841-937-21929 FAX: (+49)-841-937-22883 E-mail: roberto.liebscher@ku.de Internet: http://www.ku.de/wwf/lfb/

Am 8/10/2016 um 7:13 PM schrieb George G. Vega Yon:

Were you able to solve this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub

https://github.com/gvegayon/parallel/issues/41#issuecomment-238936220,

or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CClK9KIfy_ dnQdLm86I3jUVvV22lCks5qegajgaJpZM4Jf0s9.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238971944, or mute the thread

https://github.com/notifications/unsubscribe-auth/AA2is4IktKkPNy_NY4YQJcER5VF1Kx9hks5qeiIogaJpZM4Jf0s9 .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/41#issuecomment-238975066, or mute the thread https://github.com/notifications/unsubscribe-auth/AT9CCpR2prUeGjZfU-2xPKFsgmS_JgUzks5qeiSpgaJpZM4Jf0s9.

gvegayon commented 8 years ago

This output catches my attention:

Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : 1mdx8agr10
Running at : C:\Users\wwa594\Documents
Randtype   : datetime
Waiting for the clusters to finish...
  0
cluster 0001 has exited without error...
  0
cluster 0004 has exited without error...
  -3621
cluster 0003 has exited without error...
  -3621
cluster 0002 has exited without error...

Ping to @bquistorff : May this have something to do with the new implementation of the parallel_run command? Since it is windows, it might be the case that it closes the sessions before saving the file (see https://github.com/gvegayon/parallel/blob/c6bb22fbcad84f5901dc6d1b904da86f3a017af5/ado/parallel_write_do.mata#L275-L277) losing the data.

bquistorff commented 8 years ago

I'll check into this. Might be a work or

bquistorff commented 8 years ago

Using StataSE-64 v13 I was not able to produce the bug from your posted setups (with the auto dataset or the dataex-type setup). (I had less RAM so I could only expand to a tenth the size). We have fixed some bugs so you can try the latest code.

With your real data it seems like the bug happens reliably when the data is too large. To me the likeliest case is that it is an issue with lack of RAM and that an alternative solution is needed. With parallel, the parent instance of Stata loads the whole dataset in memory and the child processes together hold the same amount. The child processes also process the whole dataset so they need additional memory (though I'm not sure how much). So even though you have more RAM than the size of your dataset, you can quickly run out of RAM in practice. I think you should consider how to process your data in chunks serially when there is an issue doing it in parallel.

I'll mark this as closed for now, but feel free to re-open if we can find a way to reproduce the bug.

gvegayon / parallel

Memory error issues when working with large dataset #41