trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.15k stars 2.93k forks source link

Iceberg tables created with Glue catalog cannot be queried by Athena #15934

Closed ekorchison closed 9 months ago

ekorchison commented 1 year ago

Trino version: 406 (over JDBC) Athena engine version: 3

I couldn't spot any options that might work around this, so opening an issue here.

When creating an Iceberg table with Trino, the table is created in the Glue catalog with no StorageDescriptor. Athena appears to use the StorageDescriptor to enumerate columns, thus the resulting table cannot be queried by Athena by column name.


Minimal steps to reproduce:

1 - Create an Iceberg table from Trino that uses the Glue catalog

--trino
create table iceberg.aws_logs.issue (
    client_ip varchar
)
with (location = 's3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/')

2 - Access a column with Athena

--athena
select client_ip
from aws_logs.issue

PERMISSION_DENIED: Access Denied: Cannot select from columns [client_ip] in table or view aws_logs.issue

3 - Inspect table aws glue get-table --database-name=aws_logs --name=issue

{
    "Table": {
        "Name": "issue",
        "DatabaseName": "aws_logs",
        "Owner": "trino",
        "CreateTime": "2023-02-01T22:38:43+00:00",
        "UpdateTime": "2023-02-01T22:38:43+00:00",
        "Retention": 0,
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-70665bb8-84da-47b1-81b6-815f89053930.metadata.json",
            "table_type": "ICEBERG"
        },
        "CreatedBy": "<snip>",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "<snip>",
        "VersionId": "0"
    }
}

Additionally, the table appears to have no columns in the Glue catalog and Athena interface. Performing a count(1) without accessing a column does work correctly.


Performing operations the other way around works as expected:

1 - Create an Iceberg table from Athena

--athena
create table aws_logs.non_issue(
    client_ip string
)
location 's3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/non_issue/'
tblproperties ( 'table_type' = 'ICEBERG' )

2 - Access a column with Trino

--trino
select client_ip
from  iceberg.aws_logs.non_issue

Success: No data

3 - Inspect table aws glue get-table --database-name=aws_logs --name=non_issue

{
    "Table": {
        "Name": "non_issue",
        "DatabaseName": "aws_logs",
        "CreateTime": "2023-02-01T22:49:53+00:00",
        "UpdateTime": "2023-02-01T22:49:53+00:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "client_ip",
                    "Type": "string",
                    "Parameters": {
                        "iceberg.field.current": "true",
                        "iceberg.field.id": "1",
                        "iceberg.field.optional": "true"
                    }
                }
            ],
            "Location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/non_issue",
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/non_issue/metadata/00000-ceef2183-070e-4d4b-a51a-73fe6da340eb.metadata.json",
            "table_type": "ICEBERG"
        },
        "CreatedBy": "<snip>",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "<snip>",
        "VersionId": "0"
    }
}
ekorchison commented 1 year ago

Java isn't my wheelhouse so I can't help much, but I'd have thought passing something extra through a .withStorageDescriptor() in GlueIcebergUtil would be the way to get it to work. As far as I can see, all the information in the StorageDescriptor can be derived from the Iceberg metadata which is available in GlueIcebergTableOperations.

alexjo2144 commented 1 year ago

Thanks for reporting, but I think this should be brought up with AWS support and fixed in Athena.

StorageDescriptor is an optional, nullable field in Glue. Even if it is set, the StorageDescriptor should likely be ignored for Iceberg tables, the source of truth for the table schema should be the Iceberg manifest, not the Glue metadata.

findepi commented 1 year ago

@pettyjamesm this sounds like an Athena's problem to me

jkleinkauff commented 1 year ago

Strange thing is, if a similar command were issued in Spark it will work and properly assign table schema and metadata on the Glue UI.

Screenshot 2023-02-05 at 22 28 39

Spark:

spark.sql("""
CREATE TABLE IF NOT EXISTS catalog.lakehouse.gold
(
    model string,
    price float
) 
USING iceberg
""")

Trino:

CREATE TABLE IF NOT EXISTS iceberg.lakehouse.gold2
(
    model varchar(20)
) ;
ekorchison commented 1 year ago

Indeed the source of truth is the manifest and these additional fields are intended for information/compatibility only. Work has been done in https://github.com/apache/iceberg/pull/3048, https://github.com/apache/iceberg/pull/3352 and https://github.com/apache/iceberg/pull/3887 to support this in the Iceberg-Glue catalog implementation, hence the current support in Spark.

To elaborate on my use case: I want to use Trino to perform ELT with the end result being the creation of Iceberg tables. I would like for these tables to be queryable from Athena, for exploration and visibility purposes, as access to the Trino cluster itself is limited.

findepi commented 1 year ago

Is there a reason why Athena should not follow the Iceberg specification? @pettyjamesm @JunhyungSong

alexjo2144 commented 1 year ago

PERMISSION_DENIED: Access Denied: Cannot select from columns [client_ip] in table or view aws_logs.issue

This also doesn't really say that it has anything to do with the StorageDescriptor to me. There may just be some permissions in Athena that you need to add?

jackye1995 commented 1 year ago

Hi, the error looks like you have some Lake Formation settings on the table and you need to update that.

If that does not work, can you send me the query ID and region? so we can take a further look. Thanks!

ekorchison commented 1 year ago

The timing is excellent, I was just about to mention you in a reply. 🙂

To my knowledge there shouldn't be any permissions - LakeFormation or otherwise - that are interfering here. I have tested it with liberal permissions. I did now notice that queries against the Trino-created table appear to work using Athena engine version 2, but do not work with version 3.

These are some query IDs in eu-central-1 showing the issue: Engine v2 (succeeds): 9e5a8ee0-c856-4f86-b773-903b2e58fce5 Engine v3 (fails): e4f01d44-d7c2-428d-bbb7-a9f250dcdc5a

jackye1995 commented 1 year ago

Checked your queries and tables involved. I think your table is created without a storage descriptor, and as a result we cannot find the column from security check perspective. We can fix that in the v3 engine to support tables without a storage descriptor.

We had a fix in our internal Trino version to populate storage descriptor, but never surfaced to OSS, we should probably do that and publish a PR here.

For other engines using Iceberg core library to create the table, this has been fixed since 0.13.0, so if you perform an update to the table it should fix this issue, for example: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-table-data.html#querying-iceberg-working-with-lf-fgac

findepi commented 1 year ago

@jackye1995 not publishing columns to the storage descriptor is a conscious choice. We obey the Iceberg specification. The iceberg specification mandates that the table metadata (the files on storage) is the source of truth for the schema of the table. What would it take to make Athena and any related security checks respect the Iceberg specification as well?

dwolfeu commented 1 year ago

Is there an update? We have the same issue.

findepi commented 11 months ago

I may have accidentally fixed this issue with https://github.com/trinodb/trino/pull/18315.

ekorchison commented 11 months ago

I am on holiday at the moment, when I get back home I will try to reproduce this issue and update here accordingly. 🙂

lsabreu96 commented 9 months ago

Hello everyone I think I'm having the same issue. I'm running trino on EMR 6.11., I've set the hive.metastore.glue.default-warehouse property and hence I'm not setting in query runtime (have tried doing so also, tho). The table is created, but neither location nor schema can be seen in Glue and queried outside of Trino.

I though it could something security related and set iceberg.security = ALLOW_ALL, but it didn't solve.

My use case is that I have some Hudi tables in a Glue catalog and wanna use trino as Transform tool to run queries against'em and write as Iceberg. At first I though it could some Glue conflict, but even creating a simple select 1 as id fails

ekorchison commented 9 months ago

This seemed to have been fixed in Trino 423 by #18315 as @findepi mentioned.

@lsabreu96 I see EMR 6.11 is using Trino 410, and even the latest 6.14 is using 422 where the issue would still be present.


Given this table:

create table aws_logs_iceberg.default.issue (
    client_ip varchar
)
with (location = 's3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/');

In Trino 422:

--athena
select client_ip
from default.issue
COLUMN_NOT_FOUND: Column 'client_ip' cannot be resolved or requester is not authorized to access requested resources
aws glue get-table --database-name=default --name=issue
{
    "Table": {
        "Name": "issue",
        "DatabaseName": "default",
        "Owner": "trino",
        "CreateTime": "2023-11-16T14:39:33+00:00",
        "UpdateTime": "2023-11-16T14:39:33+00:00",
        "Retention": 0,
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-d38646a0-096a-44e3-a289-871aabfb9045.metadata.json",
            "table_type": "ICEBERG"
        },
        "CreatedBy": "<snip>",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "<snip>",
        "VersionId": "0"
    }
}

In Trino 423:

--athena
select client_ip
from default.issue
No results
aws glue get-table --database-name=default --name=issue
{
    "Table": {
        "Name": "issue",
        "DatabaseName": "default",
        "Owner": "trino",
        "CreateTime": "2023-11-16T14:45:27+00:00",
        "UpdateTime": "2023-11-16T14:45:27+00:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "client_ip",
                    "Type": "string",
                    "Parameters": {
                        "trino_type_id": "varchar"
                    }
                }
            ],
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-7f6449e4-433e-4e86-bb49-d472e3e3518a.metadata.json",
            "trino_table_metadata_info_valid_for": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-7f6449e4-433e-4e86-bb49-d472e3e3518a.metadata.json",
            "table_type": "ICEBERG"
        },
        "CreatedBy": "<snip>",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "<snip>",
        "VersionId": "0"
    }
}
lsabreu96 commented 9 months ago

@ekorchison , got it ! I was about to spin trino on EKS to try another version

lsabreu96 commented 9 months ago

It worked. Thanks

As of EMR 6.15 trino 424 is available and has the fix

ekorchison commented 9 months ago

Sweet, I'll close this issue off then!