H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
H2O fails to detect the column names of ORC files when parsing.
In hive you can describe an orc file:
{code:sql}
hive> describe orcair;
OK
year string
month string
dayofmonth string
dayofweek string
deptime string
crsdeptime string
arrtime string
crsarrtime string
uniquecarrier string
flightnum string
tailnum string
actualelapsedtime string
crselapsedtime string
airtime string
arrdelay string
depdelay string
origin string
dest string
distance string
taxiin string
taxiout string
cancelled string
cancellationcode string
diverted string
carrierdelay string
weatherdelay string
nasdelay string
securitydelay string
lateaircraftdelay string
Time taken: 0.398 seconds, Fetched: 29 row(s)
{code}
However if you were to import the dataset the output column names are "_col0", "_col1", "_col2", etc...
{code:r}
library(h2o)
ip = "mr-0xd9"
port = 63313
h2o.init(ip = ip, port = port)
air.hex = h2o.importFile("hdfs://mr-0xd6.0xdata.loc:8020/apps/hive/warehouse/orcair")
names(air.hex)
H2O fails to detect the column names of ORC files when parsing.
In hive you can describe an orc file: {code:sql} hive> describe orcair;
OK year string
month string
dayofmonth string
dayofweek string
deptime string
crsdeptime string
arrtime string
crsarrtime string
uniquecarrier string
flightnum string
tailnum string
actualelapsedtime string
crselapsedtime string
airtime string
arrdelay string
depdelay string
origin string
dest string
distance string
taxiin string
taxiout string
cancelled string
cancellationcode string
diverted string
carrierdelay string
weatherdelay string
nasdelay string
securitydelay string
lateaircraftdelay string
Time taken: 0.398 seconds, Fetched: 29 row(s) {code}
However if you were to import the dataset the output column names are "_col0", "_col1", "_col2", etc... {code:r} library(h2o) ip = "mr-0xd9" port = 63313 h2o.init(ip = ip, port = port) air.hex = h2o.importFile("hdfs://mr-0xd6.0xdata.loc:8020/apps/hive/warehouse/orcair") names(air.hex)