h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

ARFF File Import: import fails if header is too large (can happen with large datasets with categoricals) #12771

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

trying to load this dataset [https://www.openml.org/d/41147] without any success in H2O-3. It appears that the header is >4MB due to the enumeration of categoricals in the header, and we have this limitation in our parser of picking only the first 4MB (cf. {{ByteVec.getFirstBytes()}} to guess the file format (and parse the header). Besides the error is quite hard to track as it's not printed anywhere in the logs, only later the following: {noformat} water.exceptions.H2OIllegalArgumentException: Cannot determine file type. for nfs://Users/seb/repos/h2o/openml/albert.arff at water.api.ParseSetupHandler.guessSetup(ParseSetupHandler.java:46) at sun.reflect.GeneratedMethodAccessor105.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at water.api.Handler.handle(Handler.java:63) at water.api.RequestServer.serve(RequestServer.java:451) at water.api.RequestServer.doGeneric(RequestServer.java:296) at water.api.RequestServer.doPost(RequestServer.java:222) at javax.servlet.http.HttpServlet.service(HttpServlet.java:755) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) {noformat}

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: root cause: {code:java} //ParseSetup.java, L56 bits = ZipUtil.getFirstUnzippedBytesChecked(bv); {code} will always unzip the first 4MB, then {code:java} static ParseSetup guessSetup(byte[] bits, byte sep, boolean singleQuotes, String[] columnNames, String[][] naStrings) { {code} has only access to those bytes to parse the header.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: also cf. [https://0xdata.atlassian.net/browse/PUBDEV-5790] to see if we can find a common fix.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: fix ready for review. large zip files are not handled in this fix: have a local branch for this but that requires much broader changes. Instead a detailed error message is returned to user when it fails parsing second (or more) chunk of a zip archive with Arff file.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5919 Assignee: Sebastien Poirier Reporter: Sebastien Poirier State: Closed Fix Version: 3.22.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/2862 https://github.com/h2oai/h2o-3/pull/2930