Open mgorbatenko opened 1 month ago
We can definitely make this configurable if there's no good default value. Can you post full stack traces from the failed queries?
We can definitely make this configurable if there's no good default value. Can you post full stack traces from the failed queries?
@nineinchnick sure! Here is a stack trace for HIVE_CANNOT_OPEN_SPLIT
which was much more common:
io.trino.spi.TrinoException: Error opening Hive split abfs://data@amperitytenantsctxk0.dfs.core.windows.net/tables/3jbpJKqf1URkptz/27C/part-00353-15528228-6432-4a1f-9652-6ca67ab524ec-c000.snappy.parquet (offset=0, length=15253765): Error reading file file: abfs://data@amperitytenantsctxk0.dfs.core.windows.net/tables/3jbpJKqf1URkptz/27C/part-00353-15528228-6432-4a1f-9652-6ca67ab524ec-c000.snappy.parquet
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:306)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:180)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:200)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:136)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
at io.trino.split.PageSourceManager$PageSourceProviderInstance.createPageSource(PageSourceManager.java:79)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:265)
at io.trino.operator.Driver.processInternal(Driver.java:403)
at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
at io.trino.operator.Driver.tryWithLock(Driver.java:709)
at io.trino.operator.Driver.process(Driver.java:298)
at io.trino.operator.Driver.processForDuration(Driver.java:269)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
at io.trino.$gen.Trino_451____20240801_143631_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.io.IOException: Error reading file file: abfs://data@amperitytenantsctxk0.dfs.core.windows.net/tables/3jbpJKqf1URkptz/27C/part-00353-15528228-6432-4a1f-9652-6ca67ab524ec-c000.snappy.parquet
at io.trino.filesystem.azure.AzureUtils.handleAzureException(AzureUtils.java:37)
at io.trino.filesystem.azure.AzureInput.readTail(AzureInput.java:100)
at io.trino.filesystem.TrinoInput.readTail(TrinoInput.java:43)
at io.trino.filesystem.tracing.TracingInput.lambda$readTail$3(TracingInput.java:81)
at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:47)
at io.trino.filesystem.tracing.TracingInput.readTail(TracingInput.java:81)
at io.trino.plugin.hive.parquet.TrinoParquetDataSource.readTailInternal(TrinoParquetDataSource.java:54)
at io.trino.parquet.AbstractParquetDataSource.readTail(AbstractParquetDataSource.java:100)
at io.trino.parquet.reader.MetadataReader.readFooter(MetadataReader.java:100)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:226)
... 18 more
Caused by: java.lang.IllegalStateException: Unbalanced enter/exit
at okio.AsyncTimeout.enter(AsyncTimeout.kt:58)
at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:384)
at okio.RealBufferedSource.indexOf(RealBufferedSource.kt:434)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:327)
at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29)
at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:180)
at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:110)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
... 3 more
Suppressed: java.lang.Exception: #block terminated with an error
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:104)
at reactor.core.publisher.Mono.block(Mono.java:1779)
at com.azure.storage.blob.specialized.BlobClientBase.openInputStream(BlobClientBase.java:393)
at com.azure.storage.blob.specialized.BlobClientBase.openInputStream(BlobClientBase.java:323)
at io.trino.filesystem.azure.AzureInput.readTail(AzureInput.java:95)
at io.trino.filesystem.TrinoInput.readTail(TrinoInput.java:43)
at io.trino.filesystem.tracing.TracingInput.lambda$readTail$3(TracingInput.java:81)
at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:47)
at io.trino.filesystem.tracing.TracingInput.readTail(TracingInput.java:81)
at io.trino.plugin.hive.parquet.TrinoParquetDataSource.readTailInternal(TrinoParquetDataSource.java:54)
at io.trino.parquet.AbstractParquetDataSource.readTail(AbstractParquetDataSource.java:100)
at io.trino.parquet.reader.MetadataReader.readFooter(MetadataReader.java:100)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:226)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:180)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:200)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:136)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
at io.trino.split.PageSourceManager$PageSourceProviderInstance.createPageSource(PageSourceManager.java:79)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:265)
at io.trino.operator.Driver.processInternal(Driver.java:403)
at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
at io.trino.operator.Driver.tryWithLock(Driver.java:709)
at io.trino.operator.Driver.process(Driver.java:298)
at io.trino.operator.Driver.processForDuration(Driver.java:269)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
at io.trino.$gen.Trino_451____20240801_143631_2.run(Unknown Source)
... 3 more
And here is a stack trace from HIVE_CURSOR_ERROR
:
io.trino.spi.TrinoException: Failed to read Parquet file: abfs://data@amperitytenantsctxk0.dfs.core.windows.net/tables/3JfbVaQ7DU4R4ev/zCj/part-02838-e3c55bb7-72a3-4353-83cc-eb81c9d89e0d-c000.snappy.parquet
at io.trino.plugin.hive.parquet.ParquetPageSource.handleException(ParquetPageSource.java:215)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$1(ParquetPageSourceFactory.java:283)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:75)
at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:312)
at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:291)
at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:186)
at io.trino.spi.Page.getLoadedPage(Page.java:238)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:271)
at io.trino.operator.Driver.processInternal(Driver.java:403)
at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
at io.trino.operator.Driver.tryWithLock(Driver.java:709)
at io.trino.operator.Driver.process(Driver.java:298)
at io.trino.operator.Driver.processForDuration(Driver.java:269)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
at io.trino.$gen.Trino_451____20240801_212619_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.io.UncheckedIOException: java.io.IOException: Error reading file file: abfs://data@amperitytenantsctxk0.dfs.core.windows.net/tables/3JfbVaQ7DU4R4ev/zCj/part-02838-e3c55bb7-72a3-4353-83cc-eb81c9d89e0d-c000.snappy.parquet
at io.trino.parquet.ChunkReader.readUnchecked(ChunkReader.java:34)
at io.trino.parquet.reader.ChunkedInputStream.readNextChunk(ChunkedInputStream.java:149)
at io.trino.parquet.reader.ChunkedInputStream.read(ChunkedInputStream.java:93)
at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:170)
at shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:100)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:632)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:532)
at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:188)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1003)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:995)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:870)
at org.apache.parquet.format.Util.read(Util.java:390)
at org.apache.parquet.format.Util.readPageHeader(Util.java:133)
at org.apache.parquet.format.Util.readPageHeader(Util.java:128)
at io.trino.parquet.reader.ParquetColumnChunkIterator.readPageHeader(ParquetColumnChunkIterator.java:116)
at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:83)
at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:42)
at com.google.common.collect.Iterators$PeekingImpl.peek(Iterators.java:1244)
at io.trino.parquet.reader.PageReader.readDictionaryPage(PageReader.java:160)
at io.trino.parquet.reader.AbstractColumnReader.setPageReader(AbstractColumnReader.java:79)
at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:459)
at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:554)
at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:537)
at io.trino.parquet.reader.ParquetReader.lambda$nextPage$3(ParquetReader.java:251)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:72)
... 17 more
Caused by: java.io.IOException: Error reading file file: abfs://data@amperitytenantsctxk0.dfs.core.windows.net/tables/3JfbVaQ7DU4R4ev/zCj/part-02838-e3c55bb7-72a3-4353-83cc-eb81c9d89e0d-c000.snappy.parquet
at io.trino.filesystem.azure.AzureUtils.handleAzureException(AzureUtils.java:37)
at io.trino.filesystem.azure.AzureInput.readFully(AzureInput.java:77)
at io.trino.filesystem.tracing.TracingInput.lambda$readFully$0(TracingInput.java:53)
at io.trino.filesystem.tracing.Tracing.lambda$withTracing$1(Tracing.java:38)
at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:47)
at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:37)
at io.trino.filesystem.tracing.TracingInput.readFully(TracingInput.java:53)
at io.trino.plugin.hive.parquet.TrinoParquetDataSource.readInternal(TrinoParquetDataSource.java:64)
at io.trino.parquet.AbstractParquetDataSource.readFully(AbstractParquetDataSource.java:122)
at io.trino.parquet.AbstractParquetDataSource$ReferenceCountedReader.read(AbstractParquetDataSource.java:332)
at io.trino.parquet.ChunkReader.readUnchecked(ChunkReader.java:31)
... 41 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -8098 is negative
at java.base/java.lang.System.arraycopy(Native Method)
at kotlin.collections.ArraysKt___ArraysJvmKt.copyInto(_ArraysJvm.kt:892)
at okio.Buffer.read(Buffer.kt:1105)
at okio.Buffer.readFully(Buffer.kt:1096)
at okio.Buffer.readByteArray(Buffer.kt:1091)
at okio.Buffer.readString(Buffer.kt:315)
at okio.Buffer.readUtf8(Buffer.kt:299)
at okhttp3.HttpUrl$Companion.percentDecode$okhttp(HttpUrl.kt:1707)
at okhttp3.HttpUrl$Companion.percentDecode$okhttp$default(HttpUrl.kt:1695)
at okhttp3.HttpUrl$Builder.build(HttpUrl.kt:1180)
at okhttp3.HttpUrl$Companion.get(HttpUrl.kt:1634)
at okhttp3.Request$Builder.url(Request.kt:192)
at com.azure.core.http.okhttp.OkHttpAsyncHttpClient.toOkHttpRequest(OkHttpAsyncHttpClient.java:148)
at com.azure.core.http.okhttp.OkHttpAsyncHttpClient.lambda$send$0(OkHttpAsyncHttpClient.java:109)
at reactor.core.publisher.MonoCallable$MonoCallableSubscription.request(MonoCallable.java:137)
at reactor.core.publisher.LambdaMonoSubscriber.onSubscribe(LambdaMonoSubscriber.java:121)
at reactor.core.publisher.MonoCallable.subscribe(MonoCallable.java:48)
at reactor.core.publisher.Mono.subscribe(Mono.java:4568)
at reactor.core.publisher.Mono.subscribeWith(Mono.java:4634)
at reactor.core.publisher.Mono.subscribe(Mono.java:4534)
at reactor.core.publisher.Mono.subscribe(Mono.java:4470)
at reactor.core.publisher.Mono.subscribe(Mono.java:4442)
at com.azure.core.http.okhttp.OkHttpAsyncHttpClient.lambda$send$2(OkHttpAsyncHttpClient.java:109)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.onRequest(MonoCreate.java:225)
at com.azure.core.http.okhttp.OkHttpAsyncHttpClient.lambda$send$3(OkHttpAsyncHttpClient.java:96)
at reactor.core.publisher.MonoCreate.subscribe(MonoCreate.java:61)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:241)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:204)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:155)
at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
at reactor.core.publisher.SerializedSubscriber.onNext(SerializedSubscriber.java:99)
at reactor.core.publisher.FluxRepeatWhen$RepeatWhenMainSubscriber.onNext(FluxRepeatWhen.java:143)
at reactor.core.publisher.MonoUsing$MonoUsingSubscriber.onNext(MonoUsing.java:231)
at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158)
at reactor.core.publisher.MonoMaterialize$MaterializeSubscriber.drain(MonoMaterialize.java:133)
at reactor.core.publisher.MonoMaterialize$MaterializeSubscriber.onComplete(MonoMaterialize.java:127)
at reactor.core.publisher.Operators.complete(Operators.java:137)
at reactor.core.publisher.MonoEmpty.subscribe(MonoEmpty.java:46)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
at reactor.core.publisher.MonoUsing.subscribe(MonoUsing.java:102)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
at reactor.core.publisher.MonoFromFluxOperator.subscribe(MonoFromFluxOperator.java:83)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
at reactor.core.publisher.Mono.subscribe(Mono.java:4568)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:265)
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
at reactor.core.publisher.Mono.subscribe(Mono.java:4568)
at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:205)
at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:63)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
at reactor.core.publisher.Mono.subscribe(Mono.java:4568)
at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:205)
at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53)
at reactor.core.publisher.Mono.subscribe(Mono.java:4552)
at reactor.core.publisher.Mono.block(Mono.java:1778)
at com.azure.storage.blob.specialized.BlobClientBase.openInputStream(BlobClientBase.java:393)
at com.azure.storage.blob.specialized.BlobClientBase.openInputStream(BlobClientBase.java:323)
at io.trino.filesystem.azure.AzureInput.readFully(AzureInput.java:65)
... 50 more
Suppressed: java.lang.Exception: #block terminated with an error
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:104)
at reactor.core.publisher.Mono.block(Mono.java:1779)
... 53 more
Ideally we should have automatic backoff and retry (did the legacy FS already do this ?) rather than having to tune config. But it's strange that the error stacktrace doesn't show anything related to throttling.
I don't see anything relevant to throttling or retries in those stack traces.
I've recently got a deep understanding how the AWS SDK does retries. Now I'm trying to find some good references for the Azure SDK, but so far I'm not seeing anything useful.
The Other details section in the overview page mentions this:
We're currently updating the Azure SDK for Java libraries to share common cloud patterns such as authentication protocols, logging, tracing, transport protocols, buffered responses, and retries.
Retries are only briefly mentioned on the HTTP clients and pipelines page, but again, without much detail.
The GitHub Wiki is not useful either, nor are the READMEs.
Adding more config options should be the last resort, so I'm trying to do the due diligence first. Maybe someone else would have some leads.
I'll go and read about retry policies and throttling here: https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#IO_Options
We don't configure any retry policy, but according to the SDK sources, the defaults are:
The legacy FS seems to have implemented some custom client-side throttling. Before we attempt to replicate this in native FS, I'd need some more details on how the issue manifests.
Until then, maybe we should add a config option for the max number of connections. @electrum WDYT?
Adding a config for max connections sounds good. I think that the Hadoop implementation is a completely custom client, originally contributed by Microsoft, so it's not exactly "client-side" in that it doesn't use the Azure SDK at all. But overall it is very different than the Azure SDK, so we might need different settings to match its behavior.
Ideally we should have automatic backoff and retry (did the legacy FS already do this ?) rather than having to tune config.
If by "legacy connector" you mean the "hadoop abfs connector" then the answer is: of course we did, including
you should look at our code for handling 100-continue responses in particular
Ideally we should have automatic backoff and retry (did the legacy FS already do this ?) rather than having to tune config.
If by "legacy connector" you mean the "hadoop abfs connector" then the answer is: of course we did, including
- backoff with jitter
- statistics collection
- rate limiting certain operations to avoid creating the problem
- pre-emptive load reduction based on recent operations
you should look at our code for handling 100-continue responses in particular
This would all be really helpful (likely necessary) if it's included in the native fs. Especially if the plan is to EOL the legacy fs support by end of year (I think I read that somewhere on slack).
Appreciate the support here.
I think that the Hadoop implementation is a completely custom client, originally contributed by Microsoft
cc @Riddle4045 @georgewfisher do you maybe have extra insights with regards to the Azure Hadoop implementation?
Earlier this week, we updated a production Trino cluster from version 451 to 453. When the cluster experienced some heavy load, we noticed a significant amount of query failures (thousands). Queries would fail intermittently with something along the lines of
HIVE_CURSOR_ERROR: Failed to read Parquet file ...
orHIVE_CANNOT_OPEN_SPLIT: Error opening Hive split ...
. Query failures were random, and would sometimes fail again or succeed on subsequent retries.This cluster uses Azure native filesystem support, and we noticed we were getting many throttling and network errors on our underlying storage accounts. Rolling the version back to 451 solved this issue. We have reason to believe this PR is the culprit. With our nodes, this increased the number of concurrent connections per host from 5 to 61. With a large number of nodes, this is a huge jump.
Increasing the number of concurrent connections to Azure is likely to increase performance but it's at risk of causing throttling issues when there are many concurrent tasks and queries running.
Is this something we could make configurable by users? Is it worth considering how errors from Azure are retried?