h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

H2O AVRO parser results in corrupted data #12313

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

AVRO is parsed using import_file, but dataframe appears to hold a corrupted form of the data inside the file.

I have turned this into a reasonably minimal example. Attached AVRO file contains a single string field, device, containing the values one, two, three.

When loaded with import_file, the column appears to have values oneee, twoee, three.

Parsing script:

{code:python}

!/usr/bin/env python

import h2o h2o.init() df = h2o.import_file("sample.avro") print df["device"].asfactor().categories() {code}

Output is: {code:python} Parse progress: |█████████████████████████████████████████████████████████| 100% ['oneee', 'three', 'twoee'] {code}

I have confirmed that the AVRO file results in expected data when parsed with Python Avro and also Java avro-tools: {code:sh} [mark@localhost avro]$ java -jar avro-tools-1.8.2.jar tojson sample.avro | sort | uniq -c log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 164 {"device":"one"} 168 {"device":"three"} 168 {"device":"two"} {code}

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Hi [~accountid:557058:c3be338b-750c-490e-94e7-d5327f98629e], thanks for the report. We will look into it.

Please note that we are planning to deprecate support for avro in the future. Our users are moving towards columnar file format (Parquet, ORC).

exalate-issue-sync[bot] commented 1 year ago

Mark Adams commented: Hey, appreciate you taking a look @michalk. Thanks for the roadmap tip, we're just evaluating the technology, so can use Parquet or ORC.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: We have not reached a decision with Avro yet - moving this to next major release for now.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5447 Assignee: New H2O Bugs Reporter: Mark Adams State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: avro Attached By: Mark Adams File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5447/avro