We have PBs of clickstream data, where-in there’s a need to retain only last x months of data in HOT tier to facilitate querying. we’d want to archive the data before those x rolling window months. As the data would be part of same table, delta-databricks knows about the archival setting done at table level and ignore those files which have surpassed y amount of days since creation. But querying the same on Trino would fail as those files have been archived.
Trino queries Fail to identify archived data on azure blob for tables with property `delta.timeUntilArchived` set on databricks
Query output before archival period set for table
trino> select *, "$path" from delta_prod.dev_30.archive_test1;
Query 20240620_151057_00193_xxrpp failed: Error opening Hive split abfs://container@account.dfs.core.windows.net/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae (offset=0, length=202): Read 202 tail bytes of file abfs://container@account.dfs.core.windows.net/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae failed: Operation failed: "This operation is not permitted on an archived blob.", 409, GET, https://mdpservingdev30.dfs.core.windows.net/bronze/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae?timeout=90, BlobArchived, "This operation is not permitted on an archived blob. RequestId:ed2b1f76-301f-0070-2624-c3d77b000000 Time:2024-06-20T15:10:58.0165890Z"
io.trino.spi.TrinoException: Error opening Hive split abfs://container@account.dfs.core.windows.net/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae (offset=0, length=202): Read 202 tail bytes of file abfs://container@account.dfs.core.windows.net/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae failed: Operation failed: "This operation is not permitted on an archived blob.", 409, GET, https://mdpservingdev30.dfs.core.windows.net/bronze/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae?timeout=90, BlobArchived, "This operation is not permitted on an archived blob. RequestId:ed2b1f76-301f-0070-2624-c3d77b000000 Time:2024-06-20T15:10:58.0165890Z"
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:306)
at io.trino.plugin.deltalake.DeltaLakePageSourceProvider.createPageSource(DeltaLakePageSourceProvider.java:245)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:61)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:264)
at io.trino.operator.Driver.processInternal(Driver.java:403)
at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
at io.trino.operator.Driver.tryWithLock(Driver.java:709)
at io.trino.operator.Driver.process(Driver.java:298)
at io.trino.operator.Driver.processForDuration(Driver.java:269)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
at io.trino.$gen.Trino_448____20240607_120941_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.io.IOException: Read 202 tail bytes of file abfs://container@account.dfs.core.windows.net/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae failed: Operation failed: "This operation is not permitted on an archived blob.", 409, GET, https://mdpservingdev30.dfs.core.windows.net/bronze/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae?timeout=90, BlobArchived, "This operation is not permitted on an archived blob. RequestId:ed2b1f76-301f-0070-2624-c3d77b000000 Time:2024-06-20T15:10:58.0165890Z"
at io.trino.filesystem.hdfs.HdfsInput.readTail(HdfsInput.java:71)
at io.trino.filesystem.TrinoInput.readTail(TrinoInput.java:43)
at io.trino.filesystem.tracing.TracingInput.lambda$readTail$3(TracingInput.java:81)
at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:47)
at io.trino.filesystem.tracing.TracingInput.readTail(TracingInput.java:81)
at io.trino.plugin.hive.parquet.MemoryParquetDataSource.<init>(MemoryParquetDataSource.java:56)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createDataSource(ParquetPageSourceFactory.java:321)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:224)
... 16 more
Caused by: java.io.IOException: Operation failed: "This operation is not permitted on an archived blob.", 409, GET, https://mdpservingdev30.dfs.core.windows.net/bronze/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae?timeout=90, BlobArchived, "This operation is not permitted on an archived blob. RequestId:ed2b1f76-301f-0070-2624-c3d77b000000 Time:2024-06-20T15:10:58.0165890Z"
at org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.readRemote(AbfsInputStream.java:559)
at org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.readInternal(AbfsInputStream.java:511)
at org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.readOneBlock(AbfsInputStream.java:318)
at org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.read(AbfsInputStream.java:264)
at java.base/java.io.DataInputStream.read(DataInputStream.java:158)
at java.base/java.io.DataInputStream.read(DataInputStream.java:158)
at io.trino.hdfs.FSDataInputStreamTail.readTail(FSDataInputStreamTail.java:59)
at io.trino.filesystem.hdfs.HdfsInput.readTail(HdfsInput.java:63)
... 23 more
Caused by: org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException: Operation failed: "This operation is not permitted on an archived blob.", 409, GET, https://mdpservingdev30.dfs.core.windows.net/bronze/prod-data/dev_30/archive_test1/20240620_092543_00115_xxrpp_ec04d50b-e77a-4fbf-aaa7-f4c4c45a04ae?timeout=90, BlobArchived, "This operation is not permitted on an archived blob. RequestId:ed2b1f76-301f-0070-2624-c3d77b000000 Time:2024-06-20T15:10:58.0165890Z"
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:234)
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:194)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation(IOStatisticsBinding.java:494)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:465)
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:192)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.read(AbfsClient.java:859)
at org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.readRemote(AbfsInputStream.java:545)
... 30 more
Hello Team,
We have PBs of clickstream data, where-in there’s a need to retain only last x months of data in HOT tier to facilitate querying. we’d want to archive the data before those x rolling window months. As the data would be part of same table, delta-databricks knows about the archival setting done at table level and ignore those files which have surpassed y amount of days since creation. But querying the same on Trino would fail as those files have been archived.
Databricks supports archived data querying, https://learn.microsoft.com/en-us/azure/databricks/optimizations/archive-delta
Trino queries Fail to identify archived data on azure blob for tables with property `delta.timeUntilArchived` set on databricks
Query output before archival period set for table
Databricks-spark query to view archived files