Open cyofeiyue opened 3 years ago
i agree this could be useful. there are some additional security implications when using shared storage for spilling, but this can be offset with encryption, if needed
cc @sopel39
I think this isn't very useful if you are already running on the cloud (since you can just add machines) - spilling to remote storage will add latency and instability (network communication is inherently unstable). See also https://trinodb.slack.com/archives/CGB0QHWSW/p1626963074086200.
spilling to remote storage will add latency
@hashhar i agree
and instability (network communication is inherently unstable).
i agree
isn't very useful if you are already running on the cloud (since you can just add machines)
i do not agree. You cannot add machines while a query is running (ie you can, but eg join or aggregation won't become more distributed) So, you may pick you your cluster size according to your best guess, but still could want to enable spill for edge cases.
And remote storage has the benefit that you don't need to pay for it upfront, as compared with eg EBS volumes.
In the scenario where my company uses trino, the hard disk of the cloud machine is small for some reasons, and there is only one disk, writing trino logs, GC logs will increase disk‘s load. If splling to disk switch is turned on, load will be further increased,Sacrificing some query latency to ensure that the query ends successfully may be an acceptable solution
i found similar mechanism in impala4.0 https://issues.apache.org/jira/browse/IMPALA-9867
trino will spill data to the local disk (spill to disk) when the memory is insufficient. is it a good idea to support overwriting to remote S3 storage. i think the feature is particularly useful when deploying trino on the cloud, because the local storage space of an instance on the cloud may be small or has only one disk device