hydradatabase / hydra

Hydra: Column-oriented Postgres. Add scalable analytics to your project in minutes.
https://www.hydra.so
Apache License 2.0
2.8k stars 73 forks source link

Fix a server crash #235

Closed japinli closed 7 months ago

japinli commented 7 months ago

When chunk_group_row_limit is bigger than 110000, there is a crash caused by ReadStripeNextVector().

For example:

CREATE TABLE t1 (id int, info text) USING columnar;
CREATE TABLE t2 (id int, info text) USING columnar;
SELECT columnar.alter_columnar_table_set('t1', chunk_group_row_limit => '10000');
SELECT columnar.alter_columnar_table_set('t2', chunk_group_row_limit => '11000');
INSERT INTO t1 SELECT id, md5(id::text) FROM generate_series(1, 10000000) id;
INSERT INTO t2 SELECT id, md5(id::text) FROM generate_series(1, 10000000) id;
[local]:345317 postgres=# SELECT count(*) FROM t1;
  count
----------
 10000000
(1 row)

[local]:345317 postgres=# SELECT count(*) FROM t2;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
: !?>

PostgreSQL version 16.1, hydra columnar dd0ef07209dd.

        if (!ReadChunkGroupNextVector(stripeReadState->chunkGroupReadState,
                                      columnValues, columnNulls, 
                                      stripeReadState->tupleDescriptor,
                                      columnValueOffset, 
                                      newVectorSize,
                                      rowNumber,
                                      chunkFirstRowNumber))
        {
            /* if this chunk group is exhausted, fetch the next one and loop */
            EndChunkGroupRead(stripeReadState->chunkGroupReadState);
            stripeReadState->chunkGroupReadState = NULL;
            stripeReadState->chunkGroupIndex++;

            if (*newVectorSize == 0)
                continue;
        }
        ^----------> here the stripeReadState->chunkGroupReadState might be freed, but used next.

        stripeReadState->currentRow += stripeReadState->chunkGroupReadState->rowCount;
JerrySievert commented 7 months ago

awesome, thanks for the PR!

japinli commented 7 months ago

On Wed, 31 Jan 2024 at 00:45, Jerry Sievert @.***> wrote:

@JerrySievert commented on this pull request.

can you please add a test (in columnar/src/test/regress/sql) for this? I am happy to help you if you need.

Yeah, I will try to add a test. However, when I try to run make check using dd0ef0720, it fails. It seems the first break commit is 9cf098a26.

Here is my environment: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy

$ pg_config --version PostgreSQL 16.1

$ gcc --version gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

japinli commented 7 months ago

Here is the regressions test out.

dd0ef0720-regression.diffs.txt 9cf098a26-regression.diffs.txt

japinli commented 7 months ago

@JerrySievert I try to add a test for this crash, however, I cannot run make check locally. Please take a look! Thanks in advance!

japinli commented 7 months ago

Here is the regressions test out.

dd0ef0720-regression.diffs.txt 9cf098a26-regression.diffs.txt

Aha, there is an OOM error on my dev. :(

mkaruza commented 7 months ago

Hi @japinli, thank you for report and PR - it looks correct to me.

I don't see crash with regression on PG15/PG16.

japinli commented 7 months ago

Hi @japinli, thank you for report and PR - it looks correct to me.

I don't see crash with regression on PG15/PG16.

Hi, @mkaruza, thanks for the review.

Yeah, it doesn't crash. The regression fails because there is an out-of-memory, which leads Postgres process being killed.

It seems 8GB RAM isn't enough for running a regression test. I'm not sure about this.