XmlSerde for Hive with PrestoDB

leolorenzoluis commented 6 years ago

I have a table in Hive that uses XmlSerde from https://github.com/dvasilen/Hive-XML-SerDe. I've uploaded the required jars in /hive-hadoop2 in presto coordinator/nodes and I am able to query the hdfs file that it points to from presto using the ff command:

hdfs dfs -cat hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml'

It prints the XML file. However, when I try to query it with presto-cli then I get the ff error:

uery 20170919_022807_00021_hwgdm failed: Error opening Hive split hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml (offset=0, length=420) using com.ibm.spss.hive.serde2.xml.XmlInputFormat: null
com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml (offset=0, length=420) using com.ibm.spss.hive.serde2.xml.XmlInputFormat: null
    at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:200)
    at com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:72)
    at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
    at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
    at com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:71)
    at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:160)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:259)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:300)
    at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:234)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:538)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:229)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:623)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:478)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at com.ibm.spss.hive.serde2.xml.XmlInputFormat$XmlRecordReader.<init>(XmlInputFormat.java:72)
    at com.ibm.spss.hive.serde2.xml.XmlInputFormat.getRecordReader(XmlInputFormat.java:57)
    at com.facebook.presto.hive.HiveUtil.lambda$createRecordReader$3(HiveUtil.java:197)
    at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
    at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:197)
    ... 20 more

I can query the xml files perfectly fine if I'm in beeline or hive client. Is this an issue from the Serde lib or Presto?

electrum commented 6 years ago

Presto doesn't really support adding custom serde jars. The best option is to add the code to Presto with appropriate unit tests. This will ensure that it works and stays working in the future.

leolorenzoluis commented 6 years ago

@electrum Thanks for your response. Does presto have a built in serde for handling XML or what would you recommend as a solution in handling XML files? Convert to other file formats that are supported such as Avra, Parquet or Orc? Spin up a postgresql and dump all XML files there since it's natively supported?

I think this is a good case where Presto should provide standard interfaces for adding custom serdes without having to fork the code and repackage it.

Does this also mean this is no longer true? https://github.com/prestodb/presto/issues/868

electrum commented 6 years ago

We would need to copy that code into Presto. Or possibly depend on it if they publish a useable Maven artifact.

The topic of custom serdes comes up maybe once a year. Having a pluggable interface is a lot of work. It's better if we have first class support for the few formats that are needed.

leolorenzoluis commented 6 years ago

Why would Presto require the custom serde? As far as I understand, Presto queries where the data lives. What is Hive doing in this case if a presto worker is just asking hive to stream back the results and report back to the coordinator? Am I wrong and missing something?

ashwinhs commented 6 years ago

This is what the documentation says -

The Hive connector allows querying data stored in a Hive data warehouse. Hive is a combination of three components:

Data files in varying formats that are typically stored in the Hadoop Distributed File System (HDFS) or in Amazon S3. Metadata about how the data files are mapped to schemas and tables. This metadata is stored in a database such as MySQL and is accessed via the Hive metastore service. A query language called HiveQL. This query language is executed on a distributed computing framework such as MapReduce or Tez. Presto only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive’s execution environment.

leolorenzoluis commented 6 years ago

Okay, so if that's the case. Can I just create a custom JDBC connector that uses HiveQL to query the XML from the table? Would there be any problem with existing Hive Connector (still want to use it for other stuff) and use the new connector for XML in Hive using JDBC?

Edit: Initial findings - It's really slow to do aggregation from Hive to Presto. Parsing takes awhile. I think I may end up needing to transform the XML to a different file format.

@ashwinhs How does presto query the data? Using hdfs client?

electrum commented 6 years ago

See the overview here: https://prestodb.io/docs/current/connector/hive.html

electrum commented 6 years ago

That's possible, but it will be very slow since all the data will be pulled via a single JDBC connection.

leolorenzoluis commented 6 years ago

That's true. I ended up having to create another table with a different format such as Parquet or Orc and Presto of course was able to query and was fast. I'm not sure if that's good enough, but the problem would be maintaining the datasets between two tables, and syncing them.

leolorenzoluis commented 6 years ago

@electrum What about other connectors that uses JDBC Connection? Wouldn't it have the same problem?

electrum commented 6 years ago

Yes, they have the same limitation.

leolorenzoluis commented 6 years ago

@electrum May I know why it would be a single JDBC connection? Also, does presto uses single JDBC connection per connector or per transaction such as getting schemas, columns, etc? Is it reusing the JDBC pools?

If what you say is true for its limitation then that means presto will be slow for example, a PostgreSQL that uses JDBC connection, and a Redis that uses JDBC connection. Performing a simple join between two data sources you say is gonna be slow?

sarwarbhuiyan commented 6 years ago

+1 Seems xml is fading in the background in the big data ecosystems but big financials still use a lot of XML in the message interchange and storage for various integration scenarios. We'd like to use the XMLSerde as part of Amazon Athena which uses Presto under the hood. Would very much appreciate the support or any workaround.

tooptoop4 commented 4 years ago

prestosql 334 has same issue, https://github.com/prestosql/presto/issues/3888 might be an alternative

prestodb / presto

XmlSerde for Hive with PrestoDB #9002