h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

Parse: Headers not detected for ORC file formats #11600

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

H2O fails to detect the column names of ORC files when parsing.

In hive you can describe an orc file: {code:sql} hive> describe orcair;
OK year string
month string
dayofmonth string
dayofweek string
deptime string
crsdeptime string
arrtime string
crsarrtime string
uniquecarrier string
flightnum string
tailnum string
actualelapsedtime string
crselapsedtime string
airtime string
arrdelay string
depdelay string
origin string
dest string
distance string
taxiin string
taxiout string
cancelled string
cancellationcode string
diverted string
carrierdelay string
weatherdelay string
nasdelay string
securitydelay string
lateaircraftdelay string
Time taken: 0.398 seconds, Fetched: 29 row(s) {code}

However if you were to import the dataset the output column names are "_col0", "_col1", "_col2", etc... {code:r} library(h2o) ip = "mr-0xd9" port = 63313 h2o.init(ip = ip, port = port) air.hex = h2o.importFile("hdfs://mr-0xd6.0xdata.loc:8020/apps/hive/warehouse/orcair") names(air.hex)

[1] "_col0" "_col1" "_col2" "_col3" "_col4" "_col5" "_col6" "_col7" "_col8" "_col9" "_col10" "_col11" "_col12" "_col13" "_col14" "_col15" [17] "_col16" "_col17" "_col18" "_col19" "_col20" "_col21" "_col22" "_col23" "_col24" "_col25" "_col26" "_col27" "_col28" {code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4721 Assignee: Tomas Nykodym Reporter: Amy Wang State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A