trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.86k stars 2.85k forks source link

Iceberg Hadoop catalog support #9256

Open jackye1995 opened 2 years ago

jackye1995 commented 2 years ago

This issue tracks the support of Hadoop catalog in Iceberg connector.

Currently I do not see much production use cases for Hadoop catalog, there are quite a few limitations of this catalog, such as issues during rename, atomic commit, etc. So I wonder if there is a good reason to add this in Trino.

cccs-tom you mentioned you are thinking about adding the support, could you provide more details? Do you currently have production use case for it?

@electrum @phd3 @findepi @ckdarby @dungdm93 @losipiuk

cccs-tom commented 2 years ago

Sorry, I missed this issue until now @jackye1995. Yes, we do currently have a production use case for this. I realize it's a bit of an edge case, but here it is:

We have two distinct AKS clusters, each with its own deployment of our tool stack (including Trino). One cluster is open to a wide audience and the other is more restrictive. The open cluster has no access to the restricted cluster, but the restricted cluster does have read-only access to the open cluster. Consequently, the open cluster's ingestion mechanism has no way of updating the restricted cluster's Hive / other catalog when it creates a new partition.

We do have use cases where the restricted cluster needs access to the data on the open cluster, and we would much rather not duplicate the ingestion and storage of that data (it is quite high-volume). So having the catalog co-located with the data, in the shared Storage Account container makes a lot of sense for us.

cccs-tom commented 2 years ago

@jackye1995 I should probably have specified, too... Our use case includes several tables that require Schema Evolution.

jackye1995 commented 2 years ago

@cccs-tom yeah sorry I thought I mentioned you but I forgot to put the @.

So are you using HadoopCatalog now for both of the clusters, or HiveCatalog? Is that a Trino cluster or Spark for write and Trino for read, or something else?

cccs-tom commented 2 years ago

@jackye1995 No worries.

Yes, we use HadoopCatalog for both clusters - no Hive. And you are correct that our ingestion pipelines do their writing with Spark. However, we also create a separate catalog that is writable for analysts so they can use it as a "sandbox" to write results when they need to. So writing is important to our use cases (I assume that's why you were asking?).

Our "prod" is currently somewhere between Alpha and Beta, so we are actually using a custom build of Trino 359 with cherry-picked commits from an early version of your #6977 work. It's been working relatively well, with a few known limitations, but it's not ideal. And obviously, that code has changed a lot since then, so we will have to put some effort into HadoopCatalog the next time we want to upgrade (we held off on 361 because of the issues with Parquet but will probably tackle upgrading to 362 next sprint).

cccs-tom commented 2 years ago

@alexjo2144 I know you took over the Glue work from @jackye1995. Is there any chance that Hadoop catalog is also on your roadmap? My team still has a prod use case for this and we are still available to contribute to this effort.

findepi commented 2 years ago

Hadoop catalog is not on our roadmap.

cccs-tom commented 2 years ago

Hadoop catalog is not on our roadmap.

Thanks for the quick reply @findepi. Just to clarify, does "not on our roadmap" mean "it's not a priority for our team but we are open to PRs"? Or is a contribution out of the question as well?

alexjo2144 commented 2 years ago

The team I'm working with here at Starburst is not planning on working on it, but contributions are always welcome. If you decide to take a try at implementing it yourself let me know if you have any questions about how the Glue version worked, or if you need reviews.

cccs-tom commented 2 years ago

Sounds good. I am by no means an expert at this, but I will give it a shot. I may only get a chance to tackle it in a month or two, though. And I will almost surely have questions and take you up on your offer! 😄

BsoBird commented 2 months ago

Hi. Currently we make heavy use of the hadoop catalog iceberg table. (FileSystem catalog table) .

One of the main reasons for using the hadoop catalog is that we want to get rid of the dependency on a particular version of hms, nessie, etc.

For an iceberg table, we can use multiple versions of HIVE access, but also multiple versions of spark/trino and other middleware access. Under this premise, we do not need to think about HMS, nessie and other middleware upgrades, and do not need to think about the compatibility of the upgrade with the original cluster.

We can keep deploying new versions of middleware without the risk of metadata corruption and data loss.

Although there are limitations to the FileSystem catalog table, since we are using it, we can live with the fact that we can't rename it. In a way, we use the FileSystem catalog table as an enhanced external table.

@jackye1995 @cccs-tom @alexjo2144 @findepi @ragnard

BsoBird commented 2 months ago

At the moment, the only remaining problems are concurrency and atomicity issues with FileSystemCatalogTable commits. I submitted a PR to fix this. We use this PR in our production environment and it works fine. Based on the filesystem, while we can't prevent clients from committing in parallel, We can avoid data corruption, also we can at least prevent commits that don't meet expectations.

BsoBird commented 2 months ago

Therefore , FileSystemCatalogTable , is available as a basic (missing some advanced features ) , lightweight implementation of the catalog , it can also be used in production environments .

BsoBird commented 2 months ago

In addition, in order to accelerate the access to the metadata in FileSystemCatalog, we can choose a solution similar to HMS, which loads the metadata from the file system into hms-cache to accelerate the access (ANALYZE TABLE CMD). Many communities are beginning to gradually adopt this approach, such as HBase (HBase will abandon the storage of any metadata in the ZK, ZK will only be used as a cache). With this solution, we can both reduce the cost of scaling (because access to metadata is still using the pre-existing like HMS, and there is no need to develop a new catalog management module), and optimize the possible fileSystem access slowness problem.