prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.92k stars 5.33k forks source link

Spill to disk for order by operator #11059

Closed rahul9269 closed 2 years ago

rahul9269 commented 6 years ago

Was going through the docs and see spill to disk has been implemented for joins and aggregations, has there been any proposal/ future plan for spill to disk for the order by operator?.

kokosing commented 6 years ago

Today there are no such plans. Can you explain your use case?

0.206 (see https://github.com/prestodb/presto/issues/10928#issuecomment-404108988) is going to have a support for distributed sort, I hope it is going to solve your problem.

rahul9269 commented 6 years ago

We have multiple queries that fail due to memory limit exceeded when using order by clause and the data to be sorted is large. Still new to presto, but as i understand all the data is first collected on the coordinator and then the sort takes place and thus breaching the memory limits.

Just curious with the above distributed sort solution, will the memory issue be solved ? or is it just going to provide improvement in query performance? Is the distributed sort going to support streaming the data, which i guess would free up memory as the data is streamed out.

dain commented 6 years ago

@rahul9269 distributed sort uses distributed memory so you can do larger sorts, but there is still a limit. Is there a reason you need to sort large result sets? In our installations at FB, we have found that virtually all cases where people were doing large sorts were unnecessary.

rahul9269 commented 6 years ago

I dont have much insights into the queries as they are customer centric, but we are okay succeeding the queries at the cost of execution time so was thinking about spill to disk for this operator.

@dain It would be greate if you could provide some cases/behaviours when doing large sorts was unnecessary?

dain commented 6 years ago

@rahul9269 users add sorts because they don't understand they are expensive. So they select a huge table and then sort it, even though their downstream code doesn't need it. For the rare case where users "need" sorted data, we have then sort it after export. There are users that believe that sorting helps queries (because it does in Hive like systems) but it is just extra wasted work to Presto.

ddrinka commented 5 years ago

@dain, We're doing a lot of sorting to get the right keys in place for efficient predicate pushdown. We chose Presto for its quick query capabilities, and we spend a lot of time thinking about ways to reduce the number of trips to S3, for instance. Having the data sorted by common query keys means tons less traffic from S3.

We've adopted the approach of using CLUSTERED BY SORTED BY for our heavy lifting. The sort used there spills to "disk" (pushes back to S3) and works for any size dataset we've thrown at it. It's great! Except it means I have to deal with CLUSTERED data which introduces a lot of other requirements (I can't just drop ORC files into an S3 bucket anymore). It's also pretty slow, compared to Hive doing the sort, for instance. But we're aiming to be 100% Presto-powered.

Presto is fast enough to scramble through a boat-load of random data, but, again, we're using it for interactive frontends where latency matters, and for this, the data really works best sorted. And sometimes that means sorting billions of rows.

I haven't tried the distributed sort functionality yet. Maybe it'll work for our scenario. But I think there are valid use cases for a spill, and it would really improve our workflow to have a simple operator that can order data as its being written to ORC, rather than defining the table as "sorted".

dain commented 5 years ago

@ddrinka I believe you are referring to Hive sorted buckets, which is a pretty different feature from this issue. Specifically, this issue is more about global total ordering where as Hive sorted buckets are limited to a single file created by a single machine. As for the slowness of writing these compared to Hive, I would guess it is related to the current code using S3 for the temporary storage instead of the local disk (this feature was developed at FB where there isn't a usable local disk), and I believe there is an issue to add an option for using the local disk.

ddrinka commented 5 years ago

My thought was that, if I had global total ordering (that could deal with large quantities of data), I would use that all day long to sort into storage. Using sorted bucketed tables comes with lots of downsides. If I could INSERT INTO x AS SELECT FROM y ORDER BY z, and end up with sorted data in x, I've accomplished my goal of sorted ORC without the downsides of clustered tables. Or maybe I'd like to CREATE TABLE x AS SELECT FROM y ORDER BY z where I can't even specify sorted bucketed parameters. If other folks want to get sorted query results to CSV, even if they probably shouldn't, they can do that too.

I'm mostly piping up because you asked for examples of the utility of this feature, and I think my use case is a place that global total ordering is helpful.

atris commented 5 years ago

Spill to disk for ORDER BY operator was recently released in 0.208 release of Starburst Presto.

l1x commented 4 years ago

@ddrinka Could you elaborate on this please?

We've adopted the approach of using CLUSTERED BY SORTED BY for our heavy lifting.

Is there a way to make the following work? I get max memory exceeded (20G) for the following query:

INSERT INTO target_table
SELECT
  *
FROM
    source_table
WHERE
  some_field='x'
-- 2       36_559       188_137   187_478
ORDER BY f0, f1, f2, f3

The numbers are the distinct counts for each field respectively. The size difference of the sorted vs unsorted data set is 50%. Sorting data is a single most important space reduction feature of any columnar data store that has RLE. Unfortunately I could not find a way to get this working with Presto. I am running into the memory limit. I am going to look into this more, I hope hat maybe bucketing gives you some ways of tweaking it or if you explain how CLUSTERED BY SORTED BY works. I could not find it in the Presto docs. I know it works in Hive though.

ddrinka commented 4 years ago

@l1x We ran into enough issues with Presto sorting that we began running Spark exclusively for ordering ORC data for faster queries (and smaller datasets) on Presto.

That said, I believe modern Presto versions (we’ve adopted PrestoSQL) handle global sort through ORDER BY much better than when this card was opened. We have an upcoming project to reconsider using Presto for this portion of ETL.

The developers have also bantered about the possibility of marking a table sorted, outside of CLUSTERED BY. This would come with two benefits: using CTAS ORDER BY to write sorted data is an implementation detail, and isn’t guaranteed to work in the future. The SQL standard doesn’t specify what happens when you INSERT INTO or CTAS with an ORDER BY clause.

Second, knowing a table is ordered allows the use of binary search, which could be more efficient than a scan of predicate-filtered rows.

I don’t believe a tracking issue exists for this work, in either repo. You could open one?

l1x commented 4 years ago

@ddrinka thanks a million! Yeah newer versions of PrestoSQL has spilling and you can spill to disk. I do not have a cluster that has this feature enabled though. One question about your Spark setup, can you control how many files your Spark job is creating? You can ignore this question, because it is totally out of context here.

I don’t believe a tracking issue exists for this work, in either repo. You could open one?

Which part are you referring to?

ddrinka commented 4 years ago

Can you control how many files your Spark job is creating?

Yes, here's a snippet of what I use:

val inputData = spark.read.orc(s"s3://bucket/prefix/${configuration.tableName}/${timeScaleSuffix}*")
val manipulatedData = configuration.additionalManipulations(inputData.drop($"Date"))
val sortedData = manipulatedData.sort(configuration.sortFields.map(column => col(column)): _*)
sortedData.
    coalesce(1).
    write.
    mode(SaveMode.Overwrite).
    format("orc").
    option("compression", "zlib").
    save(destS3Path)

This writes a single sorted file, and scales to any data size I've thrown at it. The final merge portion of the distributed sort takes place on a single node, but should be streaming, so performance is impacted but there isn't a single-node memory availability requirement.

We use Databricks, which handles all the scheduling and offers really a fantastic UI for Spark. It's also pretty cheap compared to EMR etc.

Which part are you referring to?

Requesting a feature for marking a table sorted outside of a CLUSTERED BY SORTED BY context. My understanding is this is the right way forward for this feature on Presto, but I haven't heard many other folks interested in it.

l1x commented 4 years ago

Excellent insight! Thank you @ddrinka! I am going to experiment with spilling to disk in the upcoming weeks and see if we can get Presto do it or we have to switch to Spark. Much appreciated!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.