heterodb / pg-strom

PG-Strom - Master development repository
http://heterodb.github.io/pg-strom/
Other
1.27k stars 163 forks source link

[XX000] ERROR: unknown alignment #747

Closed ValentinChirikov closed 2 months ago

ValentinChirikov commented 3 months ago

Hello, i've experienced postgresql error "[XX000] ERROR: unknown alignment" while running agg (sum) query against table with numeric fields. If i set pg_strom.enable_numeric_aggfuncs=off or set jit = off query executes with no errors . pg-stom version 5.0.4 build from source. May you point me where i've made mistake. pgstrom_gpu_device_info.json

update: it seems the problem is not with numerics, i can reproduce the problem on table without numerics, with 40M rows

ValentinChirikov commented 3 months ago

Just checked the master branch, received the same error output: [2024-04-03 13:37:52] [XX000] ERROR: unknown alignment

Sorry, i haven't mentioned - the PostgreSQL is of v15.5 after pg_upgrade from v12

ValentinChirikov commented 3 months ago

Made some research, the error comes from postgresql-15.5\src\backend\jit\llvm\llvmjit_deform.c line 506

        /* determine required alignment */
        if (att->attalign == TYPALIGN_INT)
            alignto = ALIGNOF_INT;
        else if (att->attalign == TYPALIGN_CHAR)
            alignto = 1;
        else if (att->attalign == TYPALIGN_DOUBLE)
            alignto = ALIGNOF_DOUBLE;
        else if (att->attalign == TYPALIGN_SHORT)
            alignto = ALIGNOF_SHORT;
        else
        {
            //elog(ERROR, "unknown alignment");
line 506:                        elog(ERROR, "unknown alignment %c", att->attalign); // fast dirty patch to check what comes - result is empy
            alignto = 0;
        }
ValentinChirikov commented 3 months ago

@kaigai could you please check if you could reproduce the problem with that test:

#from dtype_numeric.sql

INSERT INTO rt_numeric (
  SELECT x, pgstrom.random_int(1,   -20000,   20000),
            pgstrom.random_int(1,  -200000,  200000),
            pgstrom.random_int(1, -2000000, 2000000),
            pgstrom.random_float(1,   -3200.0,   3200.0),
            pgstrom.random_float(1,  -32000.0,  32000.0),
            pgstrom.random_float(1, -320000.0, 320000.0),
            pgstrom.random_float(1,    -20000,    20000)::numeric,
            pgstrom.random_int(1, -2000000000,
                                   2000000000)::numeric / 1000::numeric,
            pgstrom.random_int(1, -2000000000,
                                   2000000000)::numeric / 1000::numeric
    FROM generate_series(1,20000000) x);

explain select sum(x), sum(y), sum(z) from rt_numeric;

in my environment it gives an error on big tables with quite a big count of rows - from 20M and bigger

ValentinChirikov commented 2 months ago

During debug i realized the the problem is that TupleDesc.natts is less then natts that is passed to slot_compile_deform(LLVMJitContext context, TupleDesc desc, const TupleTableSlotOps ops, int natts) so code addresses unexisting attributes. @kaigai could it be the reason - https://www.postgresql.org/message-id/flat/CAJRYxu%2B3wqXCuyGtgYwGbsZt1CYA7mcXJJPUwXih-1n5LKA6Qw%40mail.gmail.com ?

ValentinChirikov commented 2 months ago

problem could be solved with postgres patch

--- postgresql-15-15.6.orig/src/backend/jit/llvm/llvmjit_expr.c
+++ postgresql-15-15.6/src/backend/jit/llvm/llvmjit_expr.c
@@ -351,7 +351,7 @@ llvm_compile_expr(ExprState *state)
                                         * function specific to tupledesc and the exact number of
                                         * to-be-extracted attributes.
                                         */
-                                       if (tts_ops && desc && (context->base.flags & PGJIT_DEFORM))
+                                       if (tts_ops && desc && desc->natts >= op->d.fetch.last_var && (context->base.flags & PGJIT_DEFORM))
                                        {
                                                l_jit_deform =
                                                        slot_compile_deform(context, desc,
kaigai commented 2 months ago

Hello,

Is this trouble related to PG-Strom? (In other words, can you reproduce the problem with pg_strom.enabled = off?)

In general, TupleDesc.natts can be t_infomask2 & HEAP_NATTS_MASK of tuple, but corner case. If JIT code does not care the scenario, it shall be reported to pgsql-hackers. (I'm not a committer of PostgreSQL itself)

ValentinChirikov commented 2 months ago

Hello ! @kaigai it is not reproducible with pg_strom.enabled = off

kaigai commented 2 months ago

Hmm. Can you show me the results of EXPLAIN VERBOSE of your query, with PG-Strom enabled. (hopefully, the latest commit of the git master.) If PostgreSQL JIT routine consumes the result of PG-Strom, potentially, it can be happen.

ValentinChirikov commented 2 months ago

Hello @kaigai still reproducible

table description:

postgres=# \d rt_native ;
                  Table "public.rt_native"
 Column |       Type       | Collation | Nullable | Default
--------+------------------+-----------+----------+---------
 id     | integer          |           |          |
 a      | smallint         |           |          |
 b      | integer          |           |          |
 c      | bigint           |           |          |
 d      | float2           |           |          |
 e      | real             |           |          |
 f      | double precision |           |          |
 p      | int1             |           |          |

latest build

2024-04-08 12:25:33.132 +03 [857498] LOG:  PG-Strom version 5.0.4 built for PostgreSQL 15 (githash: 28e37285ecf25132b85537b18c78c67620e3d4a7)
2024-04-08 12:25:33.906 +03 [857498] LOG:  PG-Strom binary built for CUDA 12.4 (CUDA runtime 12.4, nvidia kmod: 550.54.14)
2024-04-08 12:25:33.906 +03 [857498] LOG:  PG-Strom: GPU0 NVIDIA T400 (6 SMs; 1425MHz, L2 512kB), RAM 1861MB (64bits, 4.77GHz), PCI-E Bar1 0MB, CC 7.5

psql (15.6 (Debian 15.6-0+deb12u1))
Type "help" for help.

postgres=# show pg_strom.enabled;
 pg_strom.enabled
------------------
 on
(1 row)

postgres=# explain analyze select sum(f) from rt_native;
ERROR:  unknown alignment
postgres=#

with pg-strom disabled

postgres=# set pg_strom.enabled = off;
SET
postgres=# explain analyze select sum(f) from rt_native;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=542340.55..542340.56 rows=1 width=8) (actual time=25575.865..25589.057 rows=1 loops=1)
   ->  Gather  (cost=542340.33..542340.54 rows=2 width=8) (actual time=25574.140..25588.914 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=541340.33..541340.34 rows=1 width=8) (actual time=25544.043..25544.048 rows=1 loops=3)
               ->  Parallel Seq Scan on rt_native  (cost=0.00..499673.47 rows=16666747 width=8) (actual time=11.560..23709.504 rows=13333333 loops=3)
 Planning Time: 0.121 ms
 JIT:
   Functions: 11
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.584 ms, Inlining 192.239 ms, Optimization 102.890 ms, Emission 82.575 ms, Total 379.288 ms
 Execution Time: 25590.621 ms
(12 rows)

with pg-strom enabled & jit disabled

postgres=# set pg_strom.enabled = on;
SET
postgres=# set jit=off;
SET
postgres=# explain analyze select sum(f) from rt_native;
                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=142456.79..142456.80 rows=1 width=8) (actual time=3009.572..3009.689 rows=1 loops=1)
   ->  Gather  (cost=142456.68..142456.79 rows=1 width=32) (actual time=3009.543..3009.667 rows=1 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Custom Scan (GpuPreAgg) on rt_native  (cost=141456.68..141456.69 rows=1 width=32) (actual time=2966.459..2966.468 rows=0 loops=3)
               GPU Projection: pgstrom.psum(f)
 Planning Time: 0.287 ms
 Execution Time: 3010.304 ms
(8 rows)
kaigai commented 2 months ago

I could reproduce, now investigating...

kaigai commented 2 months ago

The series of above commits fixed the problem, and much simplified the CPU fallback code.

postgres=# explain analyze select sum(f) from rt_numeric;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=105283.03..105283.04 rows=1 width=8) (actual time=1171.960..1172.023 rows=1 loops=1)
   ->  Gather  (cost=105282.92..105283.03 rows=1 width=32) (actual time=1167.162..1167.227 rows=1 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Custom Scan (GpuPreAgg) on rt_numeric  (cost=104282.92..104282.93 rows=1 width=32) (actual time=1127.820..1127.823 rows=0 loops=3)
               GPU Projection: pgstrom.psum(f)
 Planning Time: 0.688 ms
 JIT:
   Functions: 14
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.567 ms, Inlining 0.000 ms, Optimization 0.533 ms, Emission 7.939 ms, Total 9.039 ms
 Execution Time: 1184.577 ms
(12 rows)
ValentinChirikov commented 2 months ago

Hello @kaigai ! Thank's works fine now for explain analyze select sum(f) from rt_numeric;

may You please check for : explain analyze select sum(a), sum(b), sum(c), sum(e) from rt_native ;

i couldn't recall - seems that test was successful before (without latest commits but with postgres patch (https://github.com/heterodb/pg-strom/issues/747#issuecomment-2041025226)

pgstrom_tests=# explain analyze select sum(a), sum(b), sum(c), sum(e) from rt_native ;
ERROR:  gpu_service.c:2572  failed on cuEventSynchronize: CUDA_ERROR_ILLEGAL_ADDRESS
HINT:  device at GPU-0, function at gpuservHandleGpuTaskExec
CONTEXT:  parallel worker
kaigai commented 2 months ago

e624cb054c23a165e96dc31ff1b6d391df03003c should fix the problem.

ValentinChirikov commented 2 months ago

Thank You @kaigai , it works fine now, i think we can close the issue.