trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.18k stars 2.94k forks source link

Remove spill to disk feature #22845

Open mosabua opened 1 month ago

mosabua commented 1 month ago

Docs available at https://trino.io/docs/current/admin/spill.html and https://trino.io/docs/current/admin/properties-spilling.html

The feature has been unmaintained and considered legacy for a long time now. With https://github.com/trinodb/trino/pull/22843 we made an official doc update to suggest users to move to FTE and not use spilling.

The docs update ships with Trino 454.

From the discussion we concluded that the whole feature should actually be deprecated and later removed. This roadmap item tries to help with the implementation of that effort.

High-level tasks:

findepi commented 1 month ago

I am not convinced the FTE mode is already superior over spill to disk functionality for sufficient percentage of workloads.

findepi commented 1 month ago

cc people involved in https://github.com/trinodb/trino/pull/22843 -- @electrum @losipiuk @wendigo and probably interested -- @pettyjamesm and possibly interested -- @trinodb/maintainers

sopel39 commented 1 month ago

From the discussion we concluded that the whole feature should actually be deprecated and later removed. This roadmap item tries to help with the implementation of that effort.

I think the criteria for removal should be (partial) perf and cost parity with FTE. I think we are long way from there TBH

sopel39 commented 1 month ago

I don't think OSS FTE is alternative to spill-to-disk atm considering perf, cost and deployment complexity. If just support for big queries is needed without resilience, spill-to-disk might be better choice currently

losipiuk commented 1 month ago

I don't think OSS FTE is alternative to spill-to-disk atm considering perf, cost and deployment complexity. If just support for big queries is needed without resilience, spill-to-disk might be better choice currently

deployment is actually trivial - you just set the S3 spooling location. But the incurred cost is significiant with Exchange implementation we have right now. Also latency is harmed; throughtput not that much.

sopel39 commented 1 month ago

deployment is actually trivial - you just set the S3 spooling location...

Yup. It seems to be classical problem of choosing two of three: speed, low compexity, low cost. And not choosing "low complexity" involves developing new stuff in OSS

raunaqmorarka commented 1 month ago

I don't think OSS FTE is alternative to spill-to-disk atm considering perf, cost and deployment complexity. If just support for big queries is needed without resilience, spill-to-disk might be better choice currently

Spill-to-disk also adds cost of having local disks, which are going to sit mostly unused. If a workload is frequently spilling to disk, then the performance tends to be really bad. What if we provide an implementation of FTE exchange which can work with local disks and optionally spill to object storage if needed (similar to Apache Hive and Spark shuffle service) ? I think that would make deployment and cost the same, with some hit to perf. The perf hit could be mitigated by optimizations specific to FTE like adaptive query optimization. It also has more predictable perf than spill which would slow execution down by a lot more than FTE when there is actual spilling.

We already made an architectural decision years ago of leaving spill-to-disk behind by introducing join optimzations only for non-spilling join operator and making that the default. We've also been recommending users to switch to FTE from spill-to-disk wherever possible. So at this point we should have some concrete action items for the minimal things we should do in FTE to remove spill-to-disk.

cshao239 commented 1 month ago

Removing a feature should have a very high bar. Currently spill-to-disk feature does not meet the bar because of reasons mentioned above. I do not think it is the right thing to do to remove spill-to-disk feature completely.

wendigo commented 1 month ago

@cshao239 keeping unmaintained and not recommended code doesn't bring any value beside headaches

findepi commented 1 month ago

it would be good to clarify in what sense the spilling code is considered unmaintained. for example, it has tests that are run as part of CI, so it can be seen as maintained.

raunaqmorarka commented 1 month ago

We don't have to remove spill to disk immediately. But we should define a concrete target for when we would consider FTE to be a reasonable replacement for spill-to-disk. The bar cannot be that FTE should be the same or better in every single way than spill-to-disk. Ideally, we should have concrete GH issues defined for what is needed to make this transition happen smoothly. Alternatively, if none of the ideas for code changes to improve FTE can ever justify FTE replacing spill-to-disk, then we should talk about restoring spill-to-disk to being first class citizen in the code and in our recommendations to users.

in what sense the spilling code is considered unmaintained

We aren't actively improving it. We've forked the join operator, put the spilling version behind a flag and optimized only the unspilled version. We haven't benchmarked spilling in a long time. I don't think this is an important point though, an unmaintained thing can become maintained if we choose to start working on it again. The important question is whether FTE can ever supercede spill-to-disk and what it would take to do so.

mosabua commented 1 month ago

Just to be clear .. there is no immediate need for this removal but directionally we are aimning that direction. At this stage we still very much believe that new users should avoid spilling to disk so the doc update stands correct.

Thank you all for the discussion here.. keep it going and we will come up with a good plan that works for everyone. We do need to know your thoughts and usage to make that decision.

mosabua commented 1 month ago

Now also read all the threads .. I think adding support for a local exchange could be really great @raunaqmorarka . Our native file system stuff already supports local disk from all I know so the lift might not be that heavy. cc @electrum

sopel39 commented 1 month ago

I've added [ ] Make FTE (almost) cost and performance equivalent to spill-to-disk. which might be satisfied by on-disk FTE

raunaqmorarka commented 1 month ago

I don't think that Make FTE (almost) cost and performance equivalent to spill-to-disk is specific enough. We should talk about details, how do you measure the cost and on what benchmark ? If it's not an empirical target, but one based on high level design trade-offs then we need to specify which features need to be implemented to meet expectations.

mosabua commented 1 month ago

I updated the description along that line @raunaqmorarka .. feel free to provide more details in the description as well .. or add any specific tasks. This is just a high-level outline of tasks after all.

losipiuk commented 1 month ago

Our native file system stuff already supports local disk from all I know so the lift might not be that heavy

The fact that we can write data locally is mostly irrelevant here (this is trivial part of the problem). Actually current implementation of filesystem exchange can write data to local disks already, and we use that in tests.

The actual work to be done is to:

And with local disks it will not be really fault tolerant, as we loose exchange data when nodes goes down (probably fine if we talk about replacement to spill which is not fault tolerant anyway).

Still I think implementing an exchange wich can utilize local disks makes sense.

sopel39 commented 1 month ago

be able to read data which is written locally, remotely by other nodes

I think the idea is to not do that, but rather colocate execution with data. Still you need either remove write or read.

losipiuk commented 1 month ago

be able to read data which is written locally, remotely by other nodes

I think the idea is to not do that, but rather colocate execution with data. Still you need either remove write or read.

How would you if you are doing hash based join. Then you technically would end up with single node execution

sopel39 commented 1 month ago

How would you if you are doing hash based join. Then you technically would end up with single node execution

Depends, but I agree probably remote read makes more sense in practice