Closed ekorchison closed 9 months ago
Java isn't my wheelhouse so I can't help much, but I'd have thought passing something extra through a .withStorageDescriptor()
in GlueIcebergUtil would be the way to get it to work. As far as I can see, all the information in the StorageDescriptor can be derived from the Iceberg metadata which is available in GlueIcebergTableOperations.
Thanks for reporting, but I think this should be brought up with AWS support and fixed in Athena.
StorageDescriptor
is an optional, nullable field in Glue. Even if it is set, the StorageDescriptor
should likely be ignored for Iceberg tables, the source of truth for the table schema should be the Iceberg manifest, not the Glue metadata.
@pettyjamesm this sounds like an Athena's problem to me
Strange thing is, if a similar command were issued in Spark it will work and properly assign table schema and metadata on the Glue UI.
Spark:
spark.sql("""
CREATE TABLE IF NOT EXISTS catalog.lakehouse.gold
(
model string,
price float
)
USING iceberg
""")
Trino:
CREATE TABLE IF NOT EXISTS iceberg.lakehouse.gold2
(
model varchar(20)
) ;
Indeed the source of truth is the manifest and these additional fields are intended for information/compatibility only. Work has been done in https://github.com/apache/iceberg/pull/3048, https://github.com/apache/iceberg/pull/3352 and https://github.com/apache/iceberg/pull/3887 to support this in the Iceberg-Glue catalog implementation, hence the current support in Spark.
To elaborate on my use case: I want to use Trino to perform ELT with the end result being the creation of Iceberg tables. I would like for these tables to be queryable from Athena, for exploration and visibility purposes, as access to the Trino cluster itself is limited.
Is there a reason why Athena should not follow the Iceberg specification? @pettyjamesm @JunhyungSong
PERMISSION_DENIED: Access Denied: Cannot select from columns [client_ip] in table or view aws_logs.issue
This also doesn't really say that it has anything to do with the StorageDescriptor
to me. There may just be some permissions in Athena that you need to add?
Hi, the error looks like you have some Lake Formation settings on the table and you need to update that.
If that does not work, can you send me the query ID and region? so we can take a further look. Thanks!
The timing is excellent, I was just about to mention you in a reply. 🙂
To my knowledge there shouldn't be any permissions - LakeFormation or otherwise - that are interfering here. I have tested it with liberal permissions. I did now notice that queries against the Trino-created table appear to work using Athena engine version 2, but do not work with version 3.
These are some query IDs in eu-central-1
showing the issue:
Engine v2 (succeeds): 9e5a8ee0-c856-4f86-b773-903b2e58fce5
Engine v3 (fails): e4f01d44-d7c2-428d-bbb7-a9f250dcdc5a
Checked your queries and tables involved. I think your table is created without a storage descriptor, and as a result we cannot find the column from security check perspective. We can fix that in the v3 engine to support tables without a storage descriptor.
We had a fix in our internal Trino version to populate storage descriptor, but never surfaced to OSS, we should probably do that and publish a PR here.
For other engines using Iceberg core library to create the table, this has been fixed since 0.13.0, so if you perform an update to the table it should fix this issue, for example: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-table-data.html#querying-iceberg-working-with-lf-fgac
@jackye1995 not publishing columns to the storage descriptor is a conscious choice. We obey the Iceberg specification. The iceberg specification mandates that the table metadata (the files on storage) is the source of truth for the schema of the table. What would it take to make Athena and any related security checks respect the Iceberg specification as well?
Is there an update? We have the same issue.
I may have accidentally fixed this issue with https://github.com/trinodb/trino/pull/18315.
I am on holiday at the moment, when I get back home I will try to reproduce this issue and update here accordingly. 🙂
Hello everyone I think I'm having the same issue. I'm running trino on EMR 6.11., I've set the hive.metastore.glue.default-warehouse property and hence I'm not setting in query runtime (have tried doing so also, tho). The table is created, but neither location nor schema can be seen in Glue and queried outside of Trino.
I though it could something security related and set iceberg.security = ALLOW_ALL
, but it didn't solve.
My use case is that I have some Hudi tables in a Glue catalog and wanna use trino as Transform tool to run queries against'em and write as Iceberg. At first I though it could some Glue conflict, but even creating a simple select 1 as id fails
This seemed to have been fixed in Trino 423 by #18315 as @findepi mentioned.
@lsabreu96 I see EMR 6.11 is using Trino 410, and even the latest 6.14 is using 422 where the issue would still be present.
Given this table:
create table aws_logs_iceberg.default.issue (
client_ip varchar
)
with (location = 's3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/');
In Trino 422:
--athena
select client_ip
from default.issue
COLUMN_NOT_FOUND: Column 'client_ip' cannot be resolved or requester is not authorized to access requested resources
aws glue get-table --database-name=default --name=issue
{
"Table": {
"Name": "issue",
"DatabaseName": "default",
"Owner": "trino",
"CreateTime": "2023-11-16T14:39:33+00:00",
"UpdateTime": "2023-11-16T14:39:33+00:00",
"Retention": 0,
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"metadata_location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-d38646a0-096a-44e3-a289-871aabfb9045.metadata.json",
"table_type": "ICEBERG"
},
"CreatedBy": "<snip>",
"IsRegisteredWithLakeFormation": false,
"CatalogId": "<snip>",
"VersionId": "0"
}
}
In Trino 423:
--athena
select client_ip
from default.issue
No results
aws glue get-table --database-name=default --name=issue
{
"Table": {
"Name": "issue",
"DatabaseName": "default",
"Owner": "trino",
"CreateTime": "2023-11-16T14:45:27+00:00",
"UpdateTime": "2023-11-16T14:45:27+00:00",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "client_ip",
"Type": "string",
"Parameters": {
"trino_type_id": "varchar"
}
}
],
"Compressed": false,
"NumberOfBuckets": 0,
"SortColumns": [],
"StoredAsSubDirectories": false
},
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"metadata_location": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-7f6449e4-433e-4e86-bb49-d472e3e3518a.metadata.json",
"trino_table_metadata_info_valid_for": "s3://my-cool-bucket/5203acb7-115f-47f9-94ba-d0a39d7ab2b5/issue/metadata/00000-7f6449e4-433e-4e86-bb49-d472e3e3518a.metadata.json",
"table_type": "ICEBERG"
},
"CreatedBy": "<snip>",
"IsRegisteredWithLakeFormation": false,
"CatalogId": "<snip>",
"VersionId": "0"
}
}
@ekorchison , got it ! I was about to spin trino on EKS to try another version
It worked. Thanks
As of EMR 6.15 trino 424 is available and has the fix
Sweet, I'll close this issue off then!
Trino version: 406 (over JDBC) Athena engine version: 3
I couldn't spot any options that might work around this, so opening an issue here.
When creating an Iceberg table with Trino, the table is created in the Glue catalog with no StorageDescriptor. Athena appears to use the StorageDescriptor to enumerate columns, thus the resulting table cannot be queried by Athena by column name.
Minimal steps to reproduce:
1 - Create an Iceberg table from Trino that uses the Glue catalog
2 - Access a column with Athena
PERMISSION_DENIED: Access Denied: Cannot select from columns [client_ip] in table or view aws_logs.issue
3 - Inspect table
aws glue get-table --database-name=aws_logs --name=issue
Additionally, the table appears to have no columns in the Glue catalog and Athena interface. Performing a
count(1)
without accessing a column does work correctly.Performing operations the other way around works as expected:
1 - Create an Iceberg table from Athena
2 - Access a column with Trino
Success:
No data
3 - Inspect table
aws glue get-table --database-name=aws_logs --name=non_issue