Open mosabua opened 1 month ago
I am not convinced the FTE mode is already superior over spill to disk functionality for sufficient percentage of workloads.
cc people involved in https://github.com/trinodb/trino/pull/22843 -- @electrum @losipiuk @wendigo and probably interested -- @pettyjamesm and possibly interested -- @trinodb/maintainers
From the discussion we concluded that the whole feature should actually be deprecated and later removed. This roadmap item tries to help with the implementation of that effort.
I think the criteria for removal should be (partial) perf and cost parity with FTE. I think we are long way from there TBH
I don't think OSS FTE is alternative to spill-to-disk atm considering perf, cost and deployment complexity. If just support for big queries is needed without resilience, spill-to-disk might be better choice currently
I don't think OSS FTE is alternative to spill-to-disk atm considering perf, cost and deployment complexity. If just support for big queries is needed without resilience, spill-to-disk might be better choice currently
deployment is actually trivial - you just set the S3 spooling location. But the incurred cost is significiant with Exchange implementation we have right now. Also latency is harmed; throughtput not that much.
deployment is actually trivial - you just set the S3 spooling location...
Yup. It seems to be classical problem of choosing two of three: speed, low compexity, low cost. And not choosing "low complexity" involves developing new stuff in OSS
I don't think OSS FTE is alternative to spill-to-disk atm considering perf, cost and deployment complexity. If just support for big queries is needed without resilience, spill-to-disk might be better choice currently
Spill-to-disk also adds cost of having local disks, which are going to sit mostly unused. If a workload is frequently spilling to disk, then the performance tends to be really bad. What if we provide an implementation of FTE exchange which can work with local disks and optionally spill to object storage if needed (similar to Apache Hive and Spark shuffle service) ? I think that would make deployment and cost the same, with some hit to perf. The perf hit could be mitigated by optimizations specific to FTE like adaptive query optimization. It also has more predictable perf than spill which would slow execution down by a lot more than FTE when there is actual spilling.
We already made an architectural decision years ago of leaving spill-to-disk behind by introducing join optimzations only for non-spilling join operator and making that the default. We've also been recommending users to switch to FTE from spill-to-disk wherever possible. So at this point we should have some concrete action items for the minimal things we should do in FTE to remove spill-to-disk.
Removing a feature should have a very high bar. Currently spill-to-disk feature does not meet the bar because of reasons mentioned above. I do not think it is the right thing to do to remove spill-to-disk feature completely.
@cshao239 keeping unmaintained and not recommended code doesn't bring any value beside headaches
it would be good to clarify in what sense the spilling code is considered unmaintained. for example, it has tests that are run as part of CI, so it can be seen as maintained.
We don't have to remove spill to disk immediately. But we should define a concrete target for when we would consider FTE to be a reasonable replacement for spill-to-disk. The bar cannot be that FTE should be the same or better in every single way than spill-to-disk. Ideally, we should have concrete GH issues defined for what is needed to make this transition happen smoothly. Alternatively, if none of the ideas for code changes to improve FTE can ever justify FTE replacing spill-to-disk, then we should talk about restoring spill-to-disk to being first class citizen in the code and in our recommendations to users.
in what sense the spilling code is considered unmaintained
We aren't actively improving it. We've forked the join operator, put the spilling version behind a flag and optimized only the unspilled version. We haven't benchmarked spilling in a long time. I don't think this is an important point though, an unmaintained thing can become maintained if we choose to start working on it again. The important question is whether FTE can ever supercede spill-to-disk and what it would take to do so.
Just to be clear .. there is no immediate need for this removal but directionally we are aimning that direction. At this stage we still very much believe that new users should avoid spilling to disk so the doc update stands correct.
Thank you all for the discussion here.. keep it going and we will come up with a good plan that works for everyone. We do need to know your thoughts and usage to make that decision.
Now also read all the threads .. I think adding support for a local exchange could be really great @raunaqmorarka . Our native file system stuff already supports local disk from all I know so the lift might not be that heavy. cc @electrum
I've added [ ] Make FTE (almost) cost and performance equivalent to spill-to-disk.
which might be satisfied by on-disk FTE
I don't think that Make FTE (almost) cost and performance equivalent to spill-to-disk
is specific enough.
We should talk about details, how do you measure the cost and on what benchmark ?
If it's not an empirical target, but one based on high level design trade-offs then we need to specify which features need to be implemented to meet expectations.
I updated the description along that line @raunaqmorarka .. feel free to provide more details in the description as well .. or add any specific tasks. This is just a high-level outline of tasks after all.
Our native file system stuff already supports local disk from all I know so the lift might not be that heavy
The fact that we can write data locally is mostly irrelevant here (this is trivial part of the problem). Actually current implementation of filesystem exchange can write data to local disks already, and we use that in tests.
The actual work to be done is to:
And with local disks it will not be really fault tolerant, as we loose exchange data when nodes goes down (probably fine if we talk about replacement to spill which is not fault tolerant anyway).
Still I think implementing an exchange wich can utilize local disks makes sense.
be able to read data which is written locally, remotely by other nodes
I think the idea is to not do that, but rather colocate execution with data. Still you need either remove write or read.
be able to read data which is written locally, remotely by other nodes
I think the idea is to not do that, but rather colocate execution with data. Still you need either remove write or read.
How would you if you are doing hash based join. Then you technically would end up with single node execution
How would you if you are doing hash based join. Then you technically would end up with single node execution
Depends, but I agree probably remote read makes more sense in practice
Docs available at https://trino.io/docs/current/admin/spill.html and https://trino.io/docs/current/admin/properties-spilling.html
The feature has been unmaintained and considered legacy for a long time now. With https://github.com/trinodb/trino/pull/22843 we made an official doc update to suggest users to move to FTE and not use spilling.
The docs update ships with Trino 454.
From the discussion we concluded that the whole feature should actually be deprecated and later removed. This roadmap item tries to help with the implementation of that effort.
High-level tasks: