simonthelin-synth commented 2 weeks ago

ArrayIndexOutOfBoundsException Issue

I am currently facing error

java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 2147483812 out of bounds for byte[2147483637]

When running a large ARRAY_AGG with delta connector, any ideas how to resolve this?
Non distinct values 25398768.
Distinct values 18637528 to aggregate.

Actual SQL

WITH

_int__list_attributes AS (
    SELECT
        'value' AS attribute_type,
        ARRAY_AGG(DISTINCT value) AS distinct_attributes
    FROM int__distinct_attributes
),

final AS (
    SELECT
        distinct_attributes,
        attribute_type,
        FROM_BIG_ENDIAN_64(
            XXHASH64(
                CAST(
                    ARRAY_JOIN(
                        distinct_attributes, ','
                    ) AS VARBINARY
                )
            )
        ) AS id
    FROM _int__list_attributes
)

SELECT * FROM final

Trino version 450

martint commented 2 weeks ago

Can you post the full stacktrace?

dekimir commented 2 weeks ago

I've also seen this error; here's the stacktrace:

{
        "type": "java.lang.ArrayIndexOutOfBoundsException",
        "message": "arraycopy: last destination index 2147483648 out of bounds for byte[2147483637]",
        "suppressed": [],
        "stack": [
          "io.trino.spi.block.VariableWidthBlockBuilder.writeEntry(VariableWidthBlockBuilder.java:104)",
          "io.trino.spi.type.AbstractVariableWidthType$DefaultReadOperators.readFlatToBlock(AbstractVariableWidthType.java:175)",
          "io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:249)",
          "io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:228)",
          "io.trino.operator.aggregation.arrayagg.GroupArrayAggregationState.writeAll(GroupArrayAggregationState.java:95)",
          "io.trino.spi.block.ArrayBlockBuilder.buildEntry(ArrayBlockBuilder.java:118)",
          "io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:66)",
          "io.trino.$gen.array_aggGroupedAccumulator_20240923_064041_876.evaluateFinal(Unknown Source)",
          "io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:108)",
          "io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:282)",
          "io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:423)",
          "io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:511)",
          "io.trino.operator.Driver.processInternal(Driver.java:400)",
          "io.trino.operator.Driver.lambda$process$8(Driver.java:303)",
          "io.trino.operator.Driver.tryWithLock(Driver.java:706)",
          "io.trino.operator.Driver.process(Driver.java:295)",
          "io.trino.operator.Driver.processForDuration(Driver.java:266)",
          "io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:891)",
          "io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:189)",
          "io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:651)",
          "io.trino.$gen.Trino_455_galaxy_1_u25_g8caca7c6b38____20240923_054632_2.run(Unknown Source)",
          "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)",
          "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)",
          "java.base/java.lang.Thread.run(Thread.java:1570)"
        ],
        "errorCode": {
          "code": 65536,
          "name": "GENERIC_INTERNAL_ERROR",
          "type": "INTERNAL_ERROR",
          "fatal": false
        }
}

In the same query, I've also seen this error, which I believe is related:

{
        "type": "java.lang.IllegalArgumentException",
        "message": "Invalid currentSize or minimumSize",
        "suppressed": [],
        "stack": [
          "io.trino.spi.block.BlockUtil.calculateNewArraySize(BlockUtil.java:90)",
          "io.trino.spi.block.VariableWidthBlockBuilder.ensureFreeSpace(VariableWidthBlockBuilder.java:406)",
          "io.trino.spi.block.VariableWidthBlockBuilder.writeEntry(VariableWidthBlockBuilder.java:103)",
          "io.trino.spi.type.AbstractVariableWidthType$DefaultReadOperators.readFlatToBlock(AbstractVariableWidthType.java:175)",
          "io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:249)",
          "io.trino.operator.aggregation.arrayagg.FlatArrayBuilder.write(FlatArrayBuilder.java:228)",
          "io.trino.operator.aggregation.arrayagg.GroupArrayAggregationState.writeAll(GroupArrayAggregationState.java:95)",
          "io.trino.spi.block.ArrayBlockBuilder.buildEntry(ArrayBlockBuilder.java:118)",
          "io.trino.operator.aggregation.arrayagg.ArrayAggregationFunction.output(ArrayAggregationFunction.java:66)",
          "io.trino.$gen.array_aggGroupedAccumulator_20240923_064041_876.evaluateFinal(Unknown Source)",
          "io.trino.operator.aggregation.GroupedAggregator.evaluate(GroupedAggregator.java:108)",
          "io.trino.operator.aggregation.builder.InMemoryHashAggregationBuilder.lambda$buildResult$2(InMemoryHashAggregationBuilder.java:282)",
          "io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:423)",
          "io.trino.operator.HashAggregationOperator.getOutput(HashAggregationOperator.java:511)",
          "io.trino.operator.Driver.processInternal(Driver.java:400)",
          "io.trino.operator.Driver.lambda$process$8(Driver.java:303)",
          "io.trino.operator.Driver.tryWithLock(Driver.java:706)",
          "io.trino.operator.Driver.process(Driver.java:295)",
          "io.trino.operator.Driver.processForDuration(Driver.java:266)",
          "io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:891)",
          "io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:189)",
          "io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:651)",
          "io.trino.$gen.Trino_455_galaxy_1_u25_g8caca7c6b38____20240923_054632_2.run(Unknown Source)",
          "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)",
          "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)",
          "java.base/java.lang.Thread.run(Thread.java:1570)"
        ],
        "errorCode": {
          "code": 65536,
          "name": "GENERIC_INTERNAL_ERROR",
          "type": "INTERNAL_ERROR",
          "fatal": false
        }
}

dain commented 2 weeks ago

I think the root cause here is the query it attempting to load an entire table into a single array:

    SELECT
        'value' AS attribute_type,
        ARRAY_AGG(DISTINCT value) AS distinct_attributes
    FROM int__distinct_attributes

The array is just too large for the system and the system is producing poor error messages when this happend. Currenlty, it is not possible to create an array where the values are more than 2147483648 bytes.

I can improve the error message, but I don't this query will work.

It appears that your goal with the query is to create a checksum of the distinct values. BTW, the current query will not produce a stable checksum because the value in the array are not ordered. Instead I suggest you use the checksum aggregation function instead, which would look like this:

_int__list_attributes AS (
    SELECT
        'value' AS attribute_type,
        checksum(DISTINCT value) AS distinct_attributes_checksum
    FROM int__distinct_attributes
),

final AS (
    SELECT
        distinct_attributes,
        attribute_type,
        FROM_BIG_ENDIAN_64(
            distinct_attributes_checksum
        ) AS id
    FROM _int__list_attributes
)

SELECT * FROM final

wendigo commented 2 weeks ago

@dain if we switch from byte arrays to Slices (FFM based) we can allocate arrays bigger than 2 GB right?

simonthelin-synth commented 2 weeks ago

I think the root cause here is the query it attempting to load an entire table into a single array:
    SELECT
        'value' AS attribute_type,
        ARRAY_AGG(DISTINCT value) AS distinct_attributes
    FROM int__distinct_attributes
The array is just too large for the system and the system is producing poor error messages when this happend. Currenlty, it is not possible to create an array where the values are more than 2147483648 bytes.

I can improve the error message, but I don't this query will work.

It appears that your goal with the query is to create a checksum of the distinct values. BTW, the current query will not produce a stable checksum because the value in the array are not ordered. Instead I suggest you use the checksum aggregation function instead, which would look like this:
_int__list_attributes AS (
    SELECT
        'value' AS attribute_type,
        checksum(DISTINCT value) AS distinct_attributes_checksum
    FROM int__distinct_attributes
),

final AS (
    SELECT
        distinct_attributes,
        attribute_type,
        FROM_BIG_ENDIAN_64(
            distinct_attributes_checksum
        ) AS id
    FROM _int__list_attributes
)

SELECT * FROM final

Hello.

Thank you for the reply and explaining the underlaying issue, which make sense.

I can see if I can get a stacktrace however it looks very identical to the other one posted.

In this use case, there is a table of 50+ columns.

Each column has distinct values which is needed to be passed into their own lists. To be fed for feeding researchers who want this data prepared in this way.

The idea as well is that there is a unique id for the distinct set of values.

I did realise that this is probably a problem, and I can work around it by introducing multiple lists of distinct values of the same column for them to work with it.

So it is not for analytical use case downstream.

If it can be solved however it would be pretty neat but I understand the complexity of this and can work around it as well.

dain commented 3 days ago

@dain if we switch from byte arrays to Slices (FFM based) we can allocate arrays bigger than 2 GB right?

@wendigo that will introduce a different set of issues, that we should discuss in a non-text-based channel. I think the better solution here is to change array (and map) to contain a list of blocks instead of a single block. I read through the code a couple of weeks ago and I think that it is fairly straightforward change. It also has the side benefit that we can start to limit the number of positions in a block to say 10k or 100k, which makes a bunch of execution stuff easier. The down side to this is rows become "infinitely large", so serializing a row will cause breakages in other parts of the code.

As a side note, I think the same trick can be applied to VariableWidthBlock so we can have larger string values without breaking too much.

trinodb / trino

ArrayIndexOutOfBoundsException Issue #23799

ArrayIndexOutOfBoundsException Issue