Read Presto Data into Spark Dataframe

thirumalairajr commented 1 month ago

Hi Team,

I have requirement for using reading data from presto query and load it into Spark Dataframe and do further processing using it in Spark.

Presto JDBC driver might not be useful for me because the amount of data read might be sometimes large.
Thinking of using Presto on Spark APIs to read the data into dataframe rather that just returning the results. Is this already supported ?

singcha commented 1 month ago

Can you share more on this? The part I am getting confused on why presto is needed to read the data, rather than just using spark df or spark-sql ?

This is not supported currently.

Depending on how large data is and where driver is running, you can go with different approaches : If you can fit the data on driver node, you can call into IPrestoSparkQueryExecution.execute() method which returns the result of presto query and then load it in df for further processing.

If data can not fit on driver :

You can use insert queries to write to intermediate location and load it back in an RDD, or
You will need to modify implementations of IPrestoSparkQueryExecution.execute() method to not fetch result of query back on driver. Instead, load the RDD in dataframe using SparkSession.createDataFrame() when executing finalFragment. See code in PrestoSparkAdaptiveQueryExecution.executeFinalFragment() which will need to be modified

thirumalairajr commented 1 month ago

@singcha Thanks for your quick response.

Can you share more on this? The part I am getting confused on why presto is needed to read the data, rather than just using spark df or spark-sql ?

We have many Presto views stored in the Hive Metastore, and there are requirements to build Spark pipelines that read data from these Presto views.

Some of the Presto views are large and might not fit in the driver memory.

If data can not fit on driver :

You can use insert queries to write to intermediate location and load it back in an RDD, or

You will need to modify implementations of IPrestoSparkQueryExecution.execute() method to not fetch result of query back on driver. Instead, load the RDD in dataframe using SparkSession.createDataFrame() when executing finalFragment. See code in PrestoSparkAdaptiveQueryExecution.executeFinalFragment() which will need to be modified

I think option 2 makes more sense because it avoids the overhead of using an intermediate storage location, and the reads can be done within the same Spark session.

We could actually write a new method, IPrestoSparkQueryExecution.read() , which would return a Spark DataFrame by following the approach you suggested.

I can contribute to this feature.

prestodb / presto

Read Presto Data into Spark Dataframe #23673