h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Intermittent: RollupStats Distributed Exception - Missing Chunk #14709

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Parsing MNIST test in GZIP on 3 nodes in runit_deeplearning_stacked_autoencoder_large.R

On driver node:

07-22 11:17:47.537 172.17.2.164:45000 12693 FJ-0-59 INFO: Parse result for test.hex_2 (10000 rows): onExCompletion for water.fvec.RollupStats$1@32807f5 water.DException$DistributedException: from /172.17.2.164:45000; by class water.fvec.RollupStats$ComputeRollupsTask; class water.DException$DistributedException: from /172.17.2.164:45002; by class water.fvec.RollupStats$Roll; class water.DException$DistributedException: from /172.17.2.164:45004; by class water.fvec.RollupStats$Roll; class java.lang.AssertionError: Missing chunk $05ffba01000014000000$nfs://home2/0xdiag/bigdata/laptop/mnist/test.csv.gz at water.fvec.Vec.chunkIdx(Vec.java:757) at water.fvec.Vec.chunkForChunkIdx(Vec.java:820) at water.MRTask.compute2(MRTask.java:624) at water.MRTask.compute2(MRTask.java:599) at water.MRTask.compute2(MRTask.java:599) at water.MRTask.compute2(MRTask.java:599) at water.H2O$H2OCountedCompleter.compute(H2O.java:947) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

On remote:

07-22 11:17:47.972 172.17.2.164:45004 12695 FJ-2-5 ERRR: Error: Missing chunk 20 for $04ffba010000ffffffff$nfs://home2/0xdiag/bigdata/laptop/mnist/test.csv.gz java.lang.Error at water.fvec.Vec.checkMissing(Vec.java:763) at water.fvec.Vec.chunkIdx(Vec.java:757) at water.fvec.Vec.chunkForChunkIdx(Vec.java:820) at water.MRTask.compute2(MRTask.java:624) at water.H2O$H2OCountedCompleter.compute(H2O.java:947) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Main parse creates 1 output chunk, but the remote node asks for chunks >> 1. Looks like a race condition and mismatched _espc.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Instrumented arno_jenkins with lots of debug info.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: http://mr-0xa1:8080/view/arno/job/arno_runit_medium_large/

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: http://mr-0xa1:8080/view/arno/job/arno_runit_medium_large/178/artifact/h2o-r/tests/results/

07-22 17:17:17.213 172.17.2.164:45004 676 main INFO: ----- H2O started ----- 07-22 17:17:17.269 172.17.2.164:45004 676 main INFO: Build git branch: (no branch) 07-22 17:17:17.270 172.17.2.164:45004 676 main INFO: Build git hash: 60b09502e0763a9278284dbe50eeb75fe14d9935 07-22 17:17:17.270 172.17.2.164:45004 676 main INFO: Build git describe: jenkins-master-3088-70-g60b0950 07-22 17:17:17.270 172.17.2.164:45004 676 main INFO: Build project version: 3.1.0.99999 07-22 17:17:17.270 172.17.2.164:45004 676 main INFO: Built by: 'jenkins' 07-22 17:17:17.270 172.17.2.164:45004 676 main INFO: Built on: '2015-07-22 17:13:54' 07-22 17:17:17.270 172.17.2.164:45004 676 main INFO: Java availableProcessors: 32 07-22 17:17:17.271 172.17.2.164:45004 676 main INFO: Java heap totalMemory: 1.92 GB 07-22 17:17:17.271 172.17.2.164:45004 676 main INFO: Java heap maxMemory: 13.33 GB 07-22 17:17:17.271 172.17.2.164:45004 676 main INFO: Java version: Java 1.7.0_80 (from Oracle Corporation) 07-22 17:17:17.271 172.17.2.164:45004 676 main INFO: JVM launch parameters: [-Xmx15g, -ea] 07-22 17:17:17.271 172.17.2.164:45004 676 main INFO: OS version: Linux 3.13.0-55-generic (amd64) 07-22 17:17:17.271 172.17.2.164:45004 676 main INFO: Machine physical memory: 251.89 GB 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: X-h2o-cluster-id: 1437610636313 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: User name: 'jenkins' 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: Opted out of sending usage metrics. 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: Possible IP Address: eth0 (eth0), fe80:0:0:0:225:90ff:fec7:2604%2 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: Possible IP Address: eth0 (eth0), 172.17.2.164 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%1 07-22 17:17:17.272 172.17.2.164:45004 676 main INFO: Possible IP Address: lo (lo), 127.0.0.1 07-22 17:17:17.273 172.17.2.164:45004 676 main INFO: Internal communication uses port: 45005 07-22 17:17:17.273 172.17.2.164:45004 676 main INFO: Listening for HTTP and REST traffic on http://172.17.2.164:45004/ 07-22 17:17:17.273 172.17.2.164:45004 676 main INFO: H2O cloud name: 'H2O_runit_jenkins_4457519' on /172.17.2.164:45004, discovery address /230.33.176.207:58913 07-22 17:17:17.274 172.17.2.164:45004 676 main INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555): 07-22 17:17:17.274 172.17.2.164:45004 676 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:45004 jenkins@172.17.2.164' 07-22 17:17:17.274 172.17.2.164:45004 676 main INFO: 2. Point your browser to http://localhost:55555 07-22 17:17:17.558 172.17.2.164:45004 676 main INFO: Log dir: '/tmp/h2o-jenkins/h2ologs' 07-22 17:17:17.558 172.17.2.164:45004 676 main INFO: Cur dir: '/home4/jenkins/slave_dir_from_mr-0xa1/workspace/arno_runit_medium_large/h2o-r/tests/results' 07-22 17:17:17.586 172.17.2.164:45004 676 main INFO: HDFS subsystem successfully initialized 07-22 17:17:17.587 172.17.2.164:45004 676 main INFO: S3 subsystem successfully initialized 07-22 17:17:17.587 172.17.2.164:45004 676 main INFO: Flow dir: '/home4/jenkins/h2oflows' 07-22 17:17:17.601 172.17.2.164:45004 676 main INFO: Cloud of size 1 formed [/172.17.2.164:45004] 07-22 17:17:17.602 172.17.2.164:45004 676 main INFO: Registered 0 extensions in: 647mS 07-22 17:17:18.018 172.17.2.164:45004 676 main INFO: Registered REST APIs in: 416mS 07-22 17:17:18.484 172.17.2.164:45004 676 main INFO: Registered: 170 schemas in: 466mS 07-22 17:17:20.634 172.17.2.164:45004 676 FJ-126-15 INFO: Cloud of size 3 formed [/172.17.2.164:45000, /172.17.2.164:45002, /172.17.2.164:45004] 07-22 17:17:42.291 172.17.2.164:45004 676 FJ-123-15 INFO: Locking cloud to new members, because water.api.RemoveAllHandler$1 07-22 17:17:44.501 172.17.2.164:45004 676 FJ-1-15 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/milsongs/milsongs-train.csv.gz 07-22 17:18:06.324 172.17.2.164:45004 676 FJ-1-13 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/mnist/train.csv.gz 07-22 17:18:41.404 172.17.2.164:45004 676 FJ-1-5 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/citibike-nyc/2013-12.csv 07-22 17:18:41.404 172.17.2.164:45004 676 FJ-1-75 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/citibike-nyc/2014-06.csv 07-22 17:18:41.404 172.17.2.164:45004 676 FJ-1-133 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/citibike-nyc/2013-09.csv 07-22 17:19:00.218 172.17.2.164:45004 676 FJ-1-111 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/mnist/train.csv.gz 07-22 17:20:44.587 172.17.2.164:45004 676 FJ-1-211 INFO: MFPT map nfs://home2/0xdiag/smalldata/prostate/prostate.csv 07-22 17:21:11.805 172.17.2.164:45004 676 FJ-1-225 INFO: MFPT map /home2/0xdiag/smalldata/arcene/shuffle_test_version/arcene.csv_1 07-22 17:21:44.754 172.17.2.164:45004 676 FJ-1-191 INFO: MFPT map /home2/0xdiag/smalldata/prostate/prostate.csv.zip_1 07-22 17:22:05.617 172.17.2.164:45004 676 FJ-1-221 INFO: MFPT map /home2/0xdiag/bigdata/laptop/mnist/test.csv.gz_3 07-22 17:22:45.327 172.17.2.164:45004 676 FJ-1-185 INFO: MFPT map nfs://home2/0xdiag/bigdata/laptop/mnist/train.csv.gz java.lang.Error at water.fvec.Vec.checkMissing(Vec.java:763) at water.fvec.Vec.chunkIdx(Vec.java:757) at water.fvec.Vec.chunkForChunkIdx(Vec.java:821) at water.MRTask.compute2(MRTask.java:627) at water.MRTask.compute2(MRTask.java:602) at water.MRTask.compute2(MRTask.java:602) at water.MRTask.compute2(MRTask.java:602) at water.MRTask.compute2(MRTask.java:602) at water.H2O$H2OCountedCompleter.compute(H2O.java:947) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) java.lang.Error at water.fvec.Vec.checkMissing(Vec.java:763) at water.fvec.Vec.chunkIdx(Vec.java:757) 07-22 17:22:55.913 172.17.2.164:45004 676 FJ-2-61 ERRR: Error: Missing chunk 130 for $04ff8c000000ffffffff$nfs://home2/0xdiag/bigdata/laptop/mnist/test.csv.gz

at water.MRTask.compute2(MRTask.java:627)
at water.H2O$H2OCountedCompleter.compute(H2O.java:947)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

07-22 17:22:55.913 172.17.2.164:45004 676 FJ-2-51 ERRR: Error: Missing chunk 183 for $04ff8c000000ffffffff$nfs://home2/0xdiag/bigdata/laptop/mnist/test.csv.gz at water.fvec.Vec.checkMissing(Vec.java:763) at water.fvec.Vec.chunkIdx(Vec.java:757) at water.fvec.Vec.chunkForChunkIdx(Vec.java:821) at water.MRTask.compute2(MRTask.java:627) at water.MRTask.compute2(MRTask.java:602) at water.MRTask.compute2(MRTask.java:602) at water.H2O$H2OCountedCompleter.compute(H2O.java:947) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) 07-22 17:22:55.914 172.17.2.164:45004 676 FJ-2-19 ERRR: Error: Missing chunk 185 for $04ff8c000000ffffffff$nfs://home2/0xdiag/bigdata/laptop/mnist/test.csv.gz 07-22 17:22:55.914 172.17.2.164:45004 676 FJ-2-61 ERRR: espc: [0, 287, 574, 861, 1183, 1470, 1756, 2042, 2328, 2650, 2938, 3225, 3547, 3834, 4121, 4407, 4729, 5015, 5303, 5627, 5913, 6199, 6485, 6808, 7095, 7381, 7669, 7991, 8278, 8565, 8887, 9175, 9461, 10000, 10288, 10574, 10860, 11184, 11470, 11757, 12045, 12368, 12656, 12942, 13265, 13552, 13838, 14125, 14411, 14734, 15021, 15308, 15596, 15882, 16207, 16494, 16781, 17068, 17390, 17677, 17964, 18252, 18575, 18863, 19150, 19437, 19759, 20048, 20334, 20622, 20945, 21233, 21520, 21807, 22129, 22416, 22703, 22989, 23276, 23600, 23887, 24174, 24497, 24783, 25071, 25357, 25643, 25966, 26252, 26540, 26828, 27150, 27436, 27723, 28010, 28296, 28620, 28907, 29194, 29517, 29804, 30090, 30379, 30667, 30991, 31277, 31565, 31852, 32139, 32462, 32750, 33037, 33325, 33647, 33933, 34221, 34507, 34795, 35117, 35405, 35691, 36015, 36303, 36590, 36876, 37164, 37451, 37773, 38060, 38347, 38634, 38956, 39245, 39533, 39858, 40144, 40431, 40718, 41040, 41328, 41615, 41903, 42226, 42513, 42801, 43124, 43411, 43698, 43986, 44308, 44596, 44883, 45171, 45457, 45780, 46067, 46392, 46679, 46968, 47253, 47540, 47863, 48150, 48437, 48724, 49047, 49335, 49621, 49908, 50196, 50482, 50806, 51092, 51379, 51668, 51991, 52280, 52602, 52891, 53179, 53467, 53789, 54076, 54362, 54650, 54972, 55259, 55545, 55833, 56156, 56443, 56729, 57016, 57339, 57627, 57913, 58200, 58523, 58811, 59098, 59421, 59708, 59995, 60281, 60568, 60890, 61175, 61462, 61749, 62034, 62356, 62644, 62931, 63254, 63541, 63829, 64116, 64438, 64725, 65012, 65300, 65623, 65911, 66197, 66483, 66805, 67093, 67380, 67668, 67991, 68277, 68565, 68853, 69140, 69463, 70000] at water.fvec.Vec.checkMissing(Vec.java:763)

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: not clear why the remote _espc is so long (and has 70k rows for a dataset with 10k rows), maybe has to do that it first parsed 60k rows, then 10k rows and the bytes get appended instead of written to a different key?

exalate-issue-sync[bot] commented 1 year ago

Brandon Hill commented: Fixed in commit: 045397200f0024a00371f3b0e25a7c59994f8e5c [0453972]

The cache for TaskGetKey was not clearing items out in a timely fashion. This was causing a race when vector names were being reused by different files (same filename, different size). This fix removes the key from the cache after an Ack is received. Manual tests show that the cache is now correctly empty after work is fully done.

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1745 Assignee: Brandon Hill Reporter: Arno Candel State: Resolved Fix Version: N/A Attachments: N/A Development PRs: N/A