h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Error importing 270k column sparse data file into H2O #6610

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

{noformat}install.packages("text2vec") library(devtools) install_github("felixr/sparsity")

library(text2vec) library(sparsity) data("movie_review") N = 5000 tokens = movie_review$review[1:N] %>% tolower %>% word_tokenizer it = itoken(tokens, progressbar = T) dtm = create_dtm(it, hash_vectorizer()) write.svmlight(dtm, labelVector = movie_review$sentiment, file = "dtm.svmlight")

hf <- h2o.importFile("dtm.svmlight"){noformat}

Gives this error:

{noformat}DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321

DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321 at water.MRTask.getResult(MRTask.java:660) at water.MRTask.getResult(MRTask.java:670) at water.MRTask.doAll(MRTask.java:530) at water.MRTask.doAll(MRTask.java:412) at water.MRTask.doAll(MRTask.java:397) at water.parser.SyntheticColumnGenerator.finalize(SyntheticColumnGenerator.java:18) at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:362) at water.parser.ParseDataset.access$000(ParseDataset.java:26) at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:203) at water.H2O$H2OCountedCompleter.compute(H2O.java:1658) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) Caused by: java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321 at water.fvec.Vec.chunkIdx(Vec.java:1062) at water.fvec.Vec.chunkForChunkIdx(Vec.java:1129) at water.util.FrameUtils.extractChunks(FrameUtils.java:1284) at water.MRTask.compute2(MRTask.java:798) at water.H2O$H2OCountedCompleter.compute1(H2O.java:1661) at water.parser.SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.compute1(SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.java) at water.H2O$H2OCountedCompleter.compute(H2O.java:1657) ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321

at water.MRTask.doAll(MRTask.java:397) at water.parser.SyntheticColumnGenerator.finalize(SyntheticColumnGenerator.java:18) at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:362) at water.parser.ParseDataset.access$000(ParseDataset.java:26) at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:203) at water.H2O$H2OCountedCompleter.compute(H2O.java:1658) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) Caused by: java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321 at water.fvec.Vec.chunkIdx(Vec.java:1062) at water.fvec.Vec.chunkForChunkIdx(Vec.java:1129) at water.util.FrameUtils.extractChunks(FrameUtils.java:1284) at water.MRTask.compute2(MRTask.java:798) at water.H2O$H2OCountedCompleter.compute1(H2O.java:1661) at water.parser.SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.compute1(SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.java) at water.H2O$H2OCountedCompleter.compute(H2O.java:1657) ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321

at water.MRTask.doAll(MRTask.java:530) at water.MRTask.doAll(MRTask.java:412) at water.MRTask.doAll(MRTask.java:397) at water.parser.SyntheticColumnGenerator.finalize(SyntheticColumnGenerator.java:18) at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:362) at water.parser.ParseDataset.access$000(ParseDataset.java:26) at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:203) at water.H2O$H2OCountedCompleter.compute(H2O.java:1658) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) Caused by: java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321 at water.fvec.Vec.chunkIdx(Vec.java:1062) at water.fvec.Vec.chunkForChunkIdx(Vec.java:1129) at water.util.FrameUtils.extractChunks(FrameUtils.java:1284) at water.MRTask.compute2(MRTask.java:798) at water.H2O$H2OCountedCompleter.compute1(H2O.java:1661) at water.parser.SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.compute1(SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.java) at water.H2O$H2OCountedCompleter.compute(H2O.java:1657) ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321

at water.MRTask.doAll(MRTask.java:530) at water.MRTask.doAll(MRTask.java:412) at water.MRTask.doAll(MRTask.java:397) at water.parser.SyntheticColumnGenerator.finalize(SyntheticColumnGenerator.java:18) at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:362) at water.parser.ParseDataset.access$000(ParseDataset.java:26) at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:203) at water.H2O$H2OCountedCompleter.compute(H2O.java:1658) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) Caused by: java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321 at water.fvec.Vec.chunkIdx(Vec.java:1062) at water.fvec.Vec.chunkForChunkIdx(Vec.java:1129) at water.util.FrameUtils.extractChunks(FrameUtils.java:1284) at water.MRTask.compute2(MRTask.java:798) at water.H2O$H2OCountedCompleter.compute1(H2O.java:1661) at water.parser.SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.compute1(SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.java) at water.H2O$H2OCountedCompleter.compute(H2O.java:1657) ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321

at water.MRTask.doAll(MRTask.java:530) at water.MRTask.doAll(MRTask.java:412) at water.MRTask.doAll(MRTask.java:397) at water.parser.SyntheticColumnGenerator.finalize(SyntheticColumnGenerator.java:18) at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:362) at water.parser.ParseDataset.access$000(ParseDataset.java:26) at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:203) at water.H2O$H2OCountedCompleter.compute(H2O.java:1658) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) Caused by: java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321 at water.fvec.Vec.chunkIdx(Vec.java:1062) at water.fvec.Vec.chunkForChunkIdx(Vec.java:1129) at water.util.FrameUtils.extractChunks(FrameUtils.java:1284) at water.MRTask.compute2(MRTask.java:798) at water.H2O$H2OCountedCompleter.compute1(H2O.java:1661) at water.parser.SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.compute1(SyntheticColumnGenerator$SyntheticColumnGeneratorTask$Icer.java) at water.H2O$H2OCountedCompleter.compute(H2O.java:1657) ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: 'Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321', caused by java.lang.IllegalStateException: Missing chunk 2 for vector $04ffffff0300ffffffff$nfs://home/ledell/dtm.svmlight; Vec info: is in DKV; home=localhost/127.0.0.1:54321; self=localhost/127.0.0.1:54321{noformat}

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8800 Assignee: New H2O Bugs Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

wendycwong commented 1 year ago

This could be related to this GH: https://github.com/h2oai/h2o-3/issues/6527