Open simonthelin-synth opened 2 weeks ago
Can you post the full stacktrace?
I've also seen this error; here's the stacktrace:
{
"type": "java.lang.ArrayIndexOutOfBoundsException",
"message": "arraycopy: last destination index 2147483648 out of bounds for byte[2147483637]",
"suppressed": [],
"stack": [
"io.trino.spi.block.VariableWidthBlockBuilder.writeEntry(VariableWidthBlockBuilder.java:104)",
"io.trino.spi.type.AbstractVariableWidthType$DefaultReadOperators.readFlatToBlock(AbstractVariableWidthType.java:175)",
"io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:249)",
"io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:228)",
"io.trino.operator.aggregation.arrayagg.GroupArrayAggregationState.writeAll(GroupArrayAggregationState.java:95)",
"io.trino.spi.block.ArrayBlockBuilder.buildEntry(ArrayBlockBuilder.java:118)",
"io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:66)",
"io.trino.$gen.array_aggGroupedAccumulator_20240923_064041_876.evaluateFinal(Unknown Source)",
"io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:108)",
"io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:282)",
"io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:423)",
"io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:511)",
"io.trino.operator.Driver.processInternal(Driver.java:400)",
"io.trino.operator.Driver.lambda$process$8(Driver.java:303)",
"io.trino.operator.Driver.tryWithLock(Driver.java:706)",
"io.trino.operator.Driver.process(Driver.java:295)",
"io.trino.operator.Driver.processForDuration(Driver.java:266)",
"io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:891)",
"io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:189)",
"io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:651)",
"io.trino.$gen.Trino_455_galaxy_1_u25_g8caca7c6b38____20240923_054632_2.run(Unknown Source)",
"java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)",
"java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)",
"java.base/java.lang.Thread.run(Thread.java:1570)"
],
"errorCode": {
"code": 65536,
"name": "GENERIC_INTERNAL_ERROR",
"type": "INTERNAL_ERROR",
"fatal": false
}
}
In the same query, I've also seen this error, which I believe is related:
{
"type": "java.lang.IllegalArgumentException",
"message": "Invalid currentSize or minimumSize",
"suppressed": [],
"stack": [
"io.trino.spi.block.BlockUtil.calculateNewArraySize(BlockUtil.java:90)",
"io.trino.spi.block.VariableWidthBlockBuilder.ensureFreeSpace(VariableWidthBlockBuilder.java:406)",
"io.trino.spi.block.VariableWidthBlockBuilder.writeEntry(VariableWidthBlockBuilder.java:103)",
"io.trino.spi.type.AbstractVariableWidthType$DefaultReadOperators.readFlatToBlock(AbstractVariableWidthType.java:175)",
"io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:249)",
"io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:228)",
"io.trino.operator.aggregation.arrayagg.GroupArrayAggregationState.writeAll(GroupArrayAggregationState.java:95)",
"io.trino.spi.block.ArrayBlockBuilder.buildEntry(ArrayBlockBuilder.java:118)",
"io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:66)",
"io.trino.$gen.array_aggGroupedAccumulator_20240923_064041_876.evaluateFinal(Unknown Source)",
"io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:108)",
"io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:282)",
"io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:423)",
"io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:511)",
"io.trino.operator.Driver.processInternal(Driver.java:400)",
"io.trino.operator.Driver.lambda$process$8(Driver.java:303)",
"io.trino.operator.Driver.tryWithLock(Driver.java:706)",
"io.trino.operator.Driver.process(Driver.java:295)",
"io.trino.operator.Driver.processForDuration(Driver.java:266)",
"io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:891)",
"io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:189)",
"io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:651)",
"io.trino.$gen.Trino_455_galaxy_1_u25_g8caca7c6b38____20240923_054632_2.run(Unknown Source)",
"java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)",
"java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)",
"java.base/java.lang.Thread.run(Thread.java:1570)"
],
"errorCode": {
"code": 65536,
"name": "GENERIC_INTERNAL_ERROR",
"type": "INTERNAL_ERROR",
"fatal": false
}
}
I think the root cause here is the query it attempting to load an entire table into a single array:
SELECT
'value' AS attribute_type,
ARRAY_AGG(DISTINCT value) AS distinct_attributes
FROM int__distinct_attributes
The array is just too large for the system and the system is producing poor error messages when this happend. Currenlty, it is not possible to create an array where the values are more than 2147483648 bytes.
I can improve the error message, but I don't this query will work.
It appears that your goal with the query is to create a checksum of the distinct values. BTW, the current query will not produce a stable checksum because the value in the array are not ordered. Instead I suggest you use the checksum
aggregation function instead, which would look like this:
_int__list_attributes AS (
SELECT
'value' AS attribute_type,
checksum(DISTINCT value) AS distinct_attributes_checksum
FROM int__distinct_attributes
),
final AS (
SELECT
distinct_attributes,
attribute_type,
FROM_BIG_ENDIAN_64(
distinct_attributes_checksum
) AS id
FROM _int__list_attributes
)
SELECT * FROM final
@dain if we switch from byte arrays to Slices (FFM based) we can allocate arrays bigger than 2 GB right?
I think the root cause here is the query it attempting to load an entire table into a single array:
SELECT 'value' AS attribute_type, ARRAY_AGG(DISTINCT value) AS distinct_attributes FROM int__distinct_attributes
The array is just too large for the system and the system is producing poor error messages when this happend. Currenlty, it is not possible to create an array where the values are more than 2147483648 bytes.
I can improve the error message, but I don't this query will work.
It appears that your goal with the query is to create a checksum of the distinct values. BTW, the current query will not produce a stable checksum because the value in the array are not ordered. Instead I suggest you use the
checksum
aggregation function instead, which would look like this:_int__list_attributes AS ( SELECT 'value' AS attribute_type, checksum(DISTINCT value) AS distinct_attributes_checksum FROM int__distinct_attributes ), final AS ( SELECT distinct_attributes, attribute_type, FROM_BIG_ENDIAN_64( distinct_attributes_checksum ) AS id FROM _int__list_attributes ) SELECT * FROM final
Hello.
Thank you for the reply and explaining the underlaying issue, which make sense.
I can see if I can get a stacktrace however it looks very identical to the other one posted.
In this use case, there is a table of 50+ columns.
Each column has distinct values which is needed to be passed into their own lists. To be fed for feeding researchers who want this data prepared in this way.
The idea as well is that there is a unique id for the distinct set of values.
I did realise that this is probably a problem, and I can work around it by introducing multiple lists of distinct values of the same column for them to work with it.
So it is not for analytical use case downstream.
If it can be solved however it would be pretty neat but I understand the complexity of this and can work around it as well.
@dain if we switch from byte arrays to Slices (FFM based) we can allocate arrays bigger than 2 GB right?
@wendigo that will introduce a different set of issues, that we should discuss in a non-text-based channel. I think the better solution here is to change array (and map) to contain a list of blocks instead of a single block. I read through the code a couple of weeks ago and I think that it is fairly straightforward change. It also has the side benefit that we can start to limit the number of positions in a block to say 10k or 100k, which makes a bunch of execution stuff easier. The down side to this is rows become "infinitely large", so serializing a row will cause breakages in other parts of the code.
As a side note, I think the same trick can be applied to VariableWidthBlock
so we can have larger string values without breaking too much.
ArrayIndexOutOfBoundsException Issue
I am currently facing error
When running a large
ARRAY_AGG
with delta connector, any ideas how to resolve this?Non distinct values
25398768
.Distinct values
18637528
to aggregate.Actual SQL