google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
141 stars 80 forks source link

Evaluate Presto/Trino as a query engine on the generated Parquet files #257

Open bashir2 opened 2 years ago

bashir2 commented 2 years ago

Presto is a query engine to run interactive SQL queries in a distributed environment. It is open-source and seems to resemble some of the features of BigQuery (I have not used Presto myself).

This issue is to track experimentation with Presto on top of Parquet files that we generate for FHIR resources. Here is an incomplete list of things we like to try:

1) Is it possible to use Preston to query Parquet files locally? If yes, document the setup. 2) How does the distributed environment with many nodes need to be set up? 3) How does it perform compared to Spark, e.g., run same queries on the same Parquet files through Presto and Spark to compare. 4) How transferable is the Presto SQL queries to other distributed SQL engines like BigQuery or Spark SQL; particular attention needs to be paid to queries involving nested and repeated fields of FHIR. 5) How well the integration of Presto and Spark works.

Update May 2024: In 2019, Trino was forked from Presto and it seems it has a faster pace of development. We should evaluate Trino too.

anhvdq commented 1 month ago

Hi @bashir2 It seems that this issue is out of date currently.

The Presto query engine has been rebranded as Trino since 2020-12-27. https://trino.io/blog/2020/12/27/announcing-trino.html AFAIK, the development of Trino is far faster than Presto's pace, so it has more features & optimization.

Should we evaluate Trino instead or keep doing with Presto?

bashir2 commented 1 month ago

Thanks @anhvdq for the note; yes I agree we should evaluate Trino too. I became aware of the blog post you shared a while ago but it seems to me that Presto is also being developed in parallel. So I think we can evaluate both (I personally have no experience with either).

bashir2 commented 1 month ago

BTW, @anhvdq if you have experience with either Presto or Trino and like to do this evaluation with FHIR Parquet files that our pipelines produce, please let me know and I'll assign this to you; we appreciate community contributions.

anhvdq commented 1 month ago

Thanks @bashir2 I have experience more experience with Trino and just a bit with Presto (as we moved to Trino right after their annoucement so I lost track of Presto's current state) I'm interested in our project but still new to FHIR format, so I'm taking a look at the format and will give out the evaluation of Trino. Then we gonna discuss more detail on this.