Please improve parquet connector to not require hive

prestodb / presto

The official home of the Presto distributed SQL query engine for big data

http://prestodb.io

Apache License 2.0

16.02k stars 5.36k forks source link

Please improve parquet connector to not require hive #12955

Open mariusa opened 5 years ago

mariusa commented 5 years ago

Please improve parquet connector to not require hive, ideally also not require hdfs

Working directly with parquet files would make Presto much easier to get adopted by data scientists.

mbasmanova commented 5 years ago

@mariuss Marius, is this something you'd like to work on?

mariusa commented 5 years ago

I don't have the skills :(

We're looking for how to query large CSV or Parquet files, Presto seems a great solution. However, it's too complex now to get started with Parquet, due to all hadoop/hive deps & steps, vs. being able to connect to Parquet directly: https://scientific-software.netlify.com/howto/how-to-query-big-csv-files

Thanks, Maria!

tooptoop4 commented 5 years ago

@mariusa u dont need hadoop/hdfs. all u need is S3 and hive (hive can run without hdfs, namenode..etc)

DennisRutjes commented 5 years ago

even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)

jeromeof commented 4 years ago

I would be very interested in this connector also - I believe (though obviously I haven't tried) that it will be significantly more efficient to query large data from Parquet files directly without requiring the data to be between hive and presto over thrift

tooptoop4 commented 4 years ago

@jeromeof the data does not come from hive, just metadata does

tooptoop4 commented 4 years ago

I haven't tried but have heard there is an experimental setting (hive.metastore=file) that could support reading from parquet directly - https://github.com/prestodb/presto/tree/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file

louiseightsix commented 4 years ago

You can indeed [ab]use hive.metastore=file to avoid setting up a real Hive metastore. You can then run MinIO to serve your local files over "S3".

I put together a docker container with the above setup; it seems to work well for ad-hoc analysis of local files, though I've not used extensively yet so YMMV.

docker run -it --mount source=/<data-dir>/,destination=/parquet,type=bind floatingwindow/presto-local-parquet

The above will start Presto and the necessary bits to read local files, then give you a Presto shell.

More info: https://github.com/floating-window/presto-local-parquet

At the moment you still need to define the schemas manually. I'm part way through writing a script to scan a set of parquet files and automagically set up corresponding schema & file mappings in Presto - I'll update the above container when it's finished.

johnnytshi commented 2 years ago

This would be really helpful when using Superset to do analysis

mbasmanova commented 2 years ago

CC: @majetideepak

MartinKosicky commented 2 months ago

Actually i was wondering to not use hdinsight, spark and for hdfs just use azure data lake genv2. Insert data into the storage using some c# mini app and query data with presto. The most strange thing is this requirement for updating the partitions each time i create a new partition

majetideepak commented 2 months ago

@MartinKosicky Are you able to share the mini app?

MartinKosicky commented 2 months ago

Actually not because i couldnt find in documentation a way how to do it. I wanted to try an architecture on azure to manually ingest parquet files with not hadoop (no spark or flinq, just a plain c# app) into azure data lake. Since azure is decoupling compute and storage, i thought that if i use presto for query that i dont need spark , hive and other technologies at all. But then there is this metastore which is a complication, that i have to deploy some fake hive metastore. I would rather create some configuration static file that describes the schema than running a service

majetideepak commented 2 months ago

Presto has unofficial support for file-based metastore: https://github.com/prestodb/presto/issues/19112 I use this way for development. For writing data, I recently wrote a small Velox app that writes a single Parquet file to storage. https://github.com/majetideepak/velox/commit/8ef1bf35de7c7c71c3481d8f6b2817d865ea8a3e It does not support partitions yet. But this could be a starting point.