h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

`h2o.importFolder` should ignore `_SUCCESS` and `.crc` files by default #16409

Open hutch3232 opened 1 month ago

hutch3232 commented 1 month ago

Is your feature request related to a problem? Please describe. When you export data (e.g., parquets) from Spark you will generally have a _SUCCESS file and .crc (checksum) files in the directory. h2o does not like these files. h2o should, like Spark does, ignore them when importing a directory. Otherwise this error is thrown:

water.exceptions.H2OIllegalArgumentException
 [1] "water.exceptions.H2OIllegalArgumentException: Column separator mismatch. One file seems to use \",\" and the other uses \"|\"."
 [2] "    water.api.ParseSetupHandler.guessSetup(ParseSetupHandler.java:50)"                                                         
 [3] "    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"                                                               
 [4] "    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)"                                             
 [5] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"                                     
 [6] "    java.lang.reflect.Method.invoke(Method.java:498)"                                                                         
 [7] "    water.api.Handler.handle(Handler.java:60)"                                                                                 
 [8] "    water.api.RequestServer.serve(RequestServer.java:472)"                                                                     
 [9] "    water.api.RequestServer.doGeneric(RequestServer.java:303)"                                                                 
[10] "    water.api.RequestServer.doPost(RequestServer.java:227)"                                                                   
[11] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:707)"                                                             
[12] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)"                                                             
[13] "    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)"                                                   
[14] "    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:554)"                                               
[15] "    org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)"                                         
[16] "    org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)"                                       
[17] "    org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)"                                         
[18] "    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)"                                                 
[19] "    org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)"                                         
[20] "    org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)"                                         
[21] "    org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)"                                             
[22] "    org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)"                                     
[23] "    org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)"                                           
[24] "    water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130)"                             
[25] "    org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)"                                     
[26] "    org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)"                                           
[27] "    org.eclipse.jetty.server.Server.handle(Server.java:516)"                                                                   
[28] "    org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)"                                               
[29] "    org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)"                                                       
[30] "    org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)"                                                         
[31] "    org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)"                                               
[32] "    [org.eclipse.jetty.io](http://org.eclipse.jetty.io/).AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)"                               
[33] "    [org.eclipse.jetty.io](http://org.eclipse.jetty.io/).FillInterest.fillable(FillInterest.java:105)"                                                         
[34] "    [org.eclipse.jetty.io](http://org.eclipse.jetty.io/).ChannelEndPoint$1.run(ChannelEndPoint.java:104)"                                                     
[35] "    org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)"                                   
[36] "    org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)"                                 
[37] "    org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)"                                 
[38] "    org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)"                                       
[39] "    org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)"                 
[40] "    org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)"                                         
[41] "    org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)"                                     
[42] "    java.lang.Thread.run(Thread.java:748)"                                                                                     

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  :

ERROR MESSAGE:

Column separator mismatch. One file seems to use "," and the other uses "|".

In addition: Warning message:
In append(newtodo, x) :
 Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  :

ERROR MESSAGE:

Column separator mismatch. One file seems to use "," and the other uses "|".

Describe the solution you'd like h2o.importFolder(...) runs without error even if the directory contains a _SUCCESS file or .crc files. Note that even h2o.exportFile by default writes .crc if format = "parquet".

Describe alternatives you've considered One option is to use the pattern argument, but that's somewhat of a nuisance.