h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Avro file read error #12077

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

When running fileImport from Flow and attempting to read an Avro file the Parse Configuration cell defaults to CSV Parser, SOH Separator and 3 cols named Obj, aavro.schema, and and fails to parse the file after setting the Parser type to Avro, Separator to Auto, and Column Headres to Auto.

The error is:

Error evaluating cell Error calling POST /3/Parse with opts ["destination-frame":"X000000_01.hex","... ERROR MESSAGE: given val type is not supported: java.lang.NoSuchMethodError

Flow UI log attached

Avro file was created with Hive by creating an avro table and then inserting data from a text table as follows:

Create Avro table

hive> create table if not exists allyears2k_avro ( Year int, Month int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier String, FlightNum int, TailNum String, ActualElapsedTime int, CRSElapsedTime int, AirTime String, ArrDelay int, DepDelay int, Origin String, Dest String, Distance int, TaxiIn String, TaxiOut String, Cancelled int, CancellationCode String, Diverted int, CarrierDelay String, WeatherDelay String, NASDelay String, SecurityDelay String, LateAircraftDelay String, IsArrDelayed String, IsDepDelayed String) row format delimited fields terminated by ',' lines terminated by '\n' stored as avro location "/user/dave/data/data_format_testing/allyears2k_avro" ;

insert data from text table into avro table

hive> insert overwrite table allyears2k_avro select * from allyears2k_txt;

exalate-issue-sync[bot] commented 1 year ago

Dave Finnegan commented: Additional info:

The same avro file is recognized when an importFile is run from a standalone H2O Instance (non-hadoop invocation).

Also, the source for this avro file is the airlines allyears.1987.2013.csv taken from s3. The final two cols have values of 'NO', or 'YES'. However, the avro version of the file contains 'YES', 'NO', and 'NOS'. There are just a couple of 'NO' values and most were changd to 'NOS'. The avro file was created via hive be importing the csv file, then creating an avro table and inserting the text table content into the avro table.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5205 Assignee: New H2O Bugs Reporter: Dave Finnegan State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: avro_parse_error Attached By: Dave Finnegan File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5205/avro_parse_error