Open exalate-issue-sync[bot] opened 1 year ago
Michal Kurka commented: Hi [~accountid:557058:c3be338b-750c-490e-94e7-d5327f98629e], thanks for the report. We will look into it.
Please note that we are planning to deprecate support for avro in the future. Our users are moving towards columnar file format (Parquet, ORC).
Mark Adams commented: Hey, appreciate you taking a look @michalk. Thanks for the roadmap tip, we're just evaluating the technology, so can use Parquet or ORC.
Michal Kurka commented: We have not reached a decision with Avro yet - moving this to next major release for now.
JIRA Issue Migration Info
Jira Issue: PUBDEV-5447 Assignee: New H2O Bugs Reporter: Mark Adams State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A
Attachments From Jira
Attachment Name: avro Attached By: Mark Adams File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5447/avro
AVRO is parsed using import_file, but dataframe appears to hold a corrupted form of the data inside the file.
I have turned this into a reasonably minimal example. Attached AVRO file contains a single string field, device, containing the values one, two, three.
When loaded with import_file, the column appears to have values oneee, twoee, three.
Parsing script:
{code:python}
!/usr/bin/env python
import h2o h2o.init() df = h2o.import_file("sample.avro") print df["device"].asfactor().categories() {code}
Output is: {code:python} Parse progress: |█████████████████████████████████████████████████████████| 100% ['oneee', 'three', 'twoee'] {code}
I have confirmed that the AVRO file results in expected data when parsed with Python Avro and also Java avro-tools: {code:sh} [mark@localhost avro]$ java -jar avro-tools-1.8.2.jar tojson sample.avro | sort | uniq -c log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 164 {"device":"one"} 168 {"device":"three"} 168 {"device":"two"} {code}