trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.26k stars 2.95k forks source link

Support for storage table management of materialized views #21797

Open mosabua opened 5 months ago

mosabua commented 5 months ago

Materialized views create storage tables transparently in Trino. However these tables are completely hidden in Trino and not accessible directly. As a result they can not be managed within Trino itself.

Use cases for this management include:

Potential approaches:

findepi commented 4 months ago

However these tables are completely hidden in Trino and not accessible directly.

Yes. The fact that the storage exists is a necessity, but what the storage is -- this is abstracted.

Use cases for this management include:

  • Improve performance by adding partitioning of the table
  • Run optimizes on the table or parts of the table
  • Remove snapshots of the table

That would break the abstraction, tying us to the current implementation.

The necessary storage optimizations should be (and probably are) carried automatically by MV refreshes. Partitioning is Iceberg MV concept and should be alterable via ALTER MATERIALIZED VIEW SET PROPERTIES.

mosabua commented 4 months ago

Maybe closely moving it to storage table semantics breaks the abstraction, but that is NOT necessary. The abstraction can stay with the fact that it is a materialized view and not a view. That materializations automatically has needs for maintenance and the relevant tasks should be possible in Trino.

Currently users are forced to use the tooling from the underlying Iceberg system to perform this maintenance. That is a bad user experience and also a more brittle approach. The refresh does NOT currently perform necessary maintenance and users are asking for more control.

osscm commented 4 months ago

The refresh does NOT currently perform necessary maintenance and users are asking for more control. yes, looks like current refresh does not support.

May be, keeping optimization separate than refresh can be handy, as users might want to only metadata optimization, instead of refreshing the MV. As users currently does optimization separately, and even build tools/eco system to do table maintenance (typically uses Spark), so can schedule MV optimization using the same logistics.

findepi commented 4 months ago

The refresh does NOT currently perform necessary maintenance and users are asking for more control.

Currently, the refresh writes data anew (like CTAS). In Hive/Iceberg/Delta CTAS is supposed to write data in a form that's decent for querying. If this is not the case, all tables written by Trino are underperforming and we need to fix the CTAS/INSERT flow in the Iceberg connector.

mosabua commented 4 months ago

Sure .. if we can improve the writing and inserting that would be good. At the same time we should allow maintenance on materialized views and not force users into some hacks with other tools on the underlying storage to work around problems.

rstyp commented 4 months ago

Here are some points for consideration:

vgankidi commented 4 months ago

As we are adding support for incremental refresh, we may have to optimize the data files in the MV too. As @rstyp mentioned metadata compaction may help with planning times as well as the metadata grows. Currently no table maintenance operations can be done on the underlying iceberg tables of MV.

findepi commented 4 months ago

Isn't metadata compaction implicit in Iceberg?

mosabua commented 4 months ago

Agreed to proceed with ALTER MATERIALIZED VIEW EXECUTE as approach in TCC. Suggestions is to follow the path of grammar PR, an SPI PR, and then a connector implementation PR. cc @dain @martint

https://github.com/trinodb/trino/wiki/Contributor-meetings#trino-contributor-call-23-may-2024

@osscm and team are going to start looking at working on this