Open mariusa opened 5 years ago
@mariuss Marius, is this something you'd like to work on?
I don't have the skills :(
We're looking for how to query large CSV or Parquet files, Presto seems a great solution. However, it's too complex now to get started with Parquet, due to all hadoop/hive deps & steps, vs. being able to connect to Parquet directly: https://scientific-software.netlify.com/howto/how-to-query-big-csv-files
Thanks, Maria!
@mariusa u dont need hadoop/hdfs. all u need is S3 and hive (hive can run without hdfs, namenode..etc)
even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)
I would be very interested in this connector also - I believe (though obviously I haven't tried) that it will be significantly more efficient to query large data from Parquet files directly without requiring the data to be between hive and presto over thrift
@jeromeof the data does not come from hive, just metadata does
I haven't tried but have heard there is an experimental setting (hive.metastore=file) that could support reading from parquet directly - https://github.com/prestodb/presto/tree/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file
You can indeed [ab]use hive.metastore=file to avoid setting up a real Hive metastore. You can then run MinIO to serve your local files over "S3".
I put together a docker container with the above setup; it seems to work well for ad-hoc analysis of local files, though I've not used extensively yet so YMMV.
docker run -it --mount source=/<data-dir>/,destination=/parquet,type=bind floatingwindow/presto-local-parquet
The above will start Presto and the necessary bits to read local files, then give you a Presto shell.
More info: https://github.com/floating-window/presto-local-parquet
At the moment you still need to define the schemas manually. I'm part way through writing a script to scan a set of parquet files and automagically set up corresponding schema & file mappings in Presto - I'll update the above container when it's finished.
This would be really helpful when using Superset to do analysis
CC: @majetideepak
Actually i was wondering to not use hdinsight, spark and for hdfs just use azure data lake genv2. Insert data into the storage using some c# mini app and query data with presto. The most strange thing is this requirement for updating the partitions each time i create a new partition
@MartinKosicky Are you able to share the mini app?
Actually not because i couldnt find in documentation a way how to do it. I wanted to try an architecture on azure to manually ingest parquet files with not hadoop (no spark or flinq, just a plain c# app) into azure data lake. Since azure is decoupling compute and storage, i thought that if i use presto for query that i dont need spark , hive and other technologies at all. But then there is this metastore which is a complication, that i have to deploy some fake hive metastore. I would rather create some configuration static file that describes the schema than running a service
Presto has unofficial support for file-based metastore: https://github.com/prestodb/presto/issues/19112 I use this way for development. For writing data, I recently wrote a small Velox app that writes a single Parquet file to storage. https://github.com/majetideepak/velox/commit/8ef1bf35de7c7c71c3481d8f6b2817d865ea8a3e It does not support partitions yet. But this could be a starting point.
Please improve parquet connector to not require hive, ideally also not require hdfs
Working directly with parquet files would make Presto much easier to get adopted by data scientists.
Related: https://github.com/prestodb/presto/issues/11955