pgspider / influxdb_fdw

InfluxDB Foreign Data Wrapper for PostgreSQL.
Other
55 stars 13 forks source link

Segmentation fault caused by SQL subquery #40

Open cobolbaby opened 7 months ago

cobolbaby commented 7 months ago
select version();

-- PostgreSQL 12.17 (Debian 12.17-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

select * from pg_available_extensions where name = 'influxdb_fdw';

-- "influxdb_fdw"   "1.3"   "1.3"   "InfluxDB Foreign Data Wrapper"

-- DROP SERVER influxdb_iot

CREATE SERVER influxdb_iot
    FOREIGN DATA WRAPPER influxdb_fdw
    OPTIONS (dbname 'iot-env', host 'http://infra-influx.ipt.inventec.net', port '80', version '1', retention_policy '');

-- DROP USER MAPPING FOR bdcuser SERVER influxdb_iot

CREATE USER MAPPING FOR bdcuser SERVER influxdb_iot
    OPTIONS ("user" 'iot', password '******');

-- DROP FOREIGN TABLE dm.iot_sqt_agv_count;

CREATE FOREIGN TABLE dm.iot_sqt_agv_count(
    "time" timestamp with time zone NULL,
    "agvState" character varying NULL COLLATE pg_catalog."default",
    value numeric NULL
)
    SERVER influxdb_iot
    OPTIONS (table 'F3AGV.count', tags 'agvState,value', schemaless 'true');

SELECT max(time) from dm.iot_sqt_agv_count where time > now() - interval '1 d';

-- "2023-12-04 19:15:53.318+08"

select * from dm.iot_sqt_agv_count 
where time = '2023-12-04 19:15:53.318+08'

"2023-12-04 19:15:53.318+08"    "BUSY"  57
"2023-12-04 19:15:53.318+08"    "CHARGING"  7
"2023-12-04 19:15:53.318+08"    "ERROR" 1
"2023-12-04 19:15:53.318+08"    "IDLE"  16
"2023-12-04 19:15:53.318+08"    "LOCKED"    7
"2023-12-04 19:15:53.318+08"    "offlineAGV"    4
"2023-12-04 19:15:53.318+08"    "onlineAGV" 88
"2023-12-04 19:15:53.318+08"    "totalAGV"  92

explain 
select * from dm.iot_sqt_agv_count 
where time = --'2023-12-04 19:15:53.318+08'
(SELECT max(time) from dm.iot_sqt_agv_count where time > now()-interval '1d');

"Foreign Scan on iot_sqt_agv_count  (cost=865.14..859.14 rows=4 width=72)"
"  InitPlan 1 (returns $0)"
"    ->  Aggregate  (cost=855.13..855.14 rows=1 width=8)"
"          ->  Foreign Scan on iot_sqt_agv_count iot_sqt_agv_count_1  (cost=10.00..853.00 rows=853 width=8)"

select * from dm.iot_sqt_agv_count 
where time = --'2023-12-04 19:15:53.318+08'
(SELECT max(time) from dm.iot_sqt_agv_count where time > now()-interval '1d');

ERROR: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

I don't know why I got the error 'Segmentation fault':

2023-12-04 19:18:25.379 CST [11] LOG:  server process (PID 330885) was terminated by signal 11: Segmentation fault
2023-12-04 19:18:25.379 CST [11] DETAIL:  Failed process was running: 
    select * from dm.iot_sqt_agv_count 
    where time = --'2023-12-04 19:15:53.318+08'
    (SELECT max(time) from dm.iot_sqt_agv_count where time > now()-interval '1d');
2023-12-04 19:18:25.379 CST [11] LOG:  terminating any other active server processes
cobolbaby commented 7 months ago

@mkgrgis Maybe you could identify the bug together, Thanks a lot.

mkgrgis commented 7 months ago

Hello, @cobolbaby ! I have never works with InfluxDB and I am not PGSpider team member, but my sqlite_fdw experience allow to locate your problem in https://github.com/pgspider/influxdb_fdw/blob/master/deparse.c This look like there is no needed deparse WHERE rules for your case. Pushdowning of your query to InfluxDB should use syntax transformation but I am not sure if there is pushdowning or if there is correct result of subquery. The main problem also for pure PostgreSQL is the same table in query and subquery. This cause very specific query plans and I am not sure this plans is well-implemented in this FDW.

mkgrgis commented 7 months ago

@cobolbaby , could you try the same query with window functions or with HAVING? Also please explainverbose for

select * from dm.iot_sqt_agv_count 
where time = --'2023-12-04 19:15:53.318+08'
(SELECT max(time) from dm.iot_sqt_agv_count where time > now()-interval '1d');

This can be more helpful sometimes than simple explain.

cobolbaby commented 7 months ago

please explain verbose for

explain verbose
select * from dm.iot_sqt_agv_count 
where time = --'2023-12-04 19:15:53.318+08'
(SELECT max(time) from dm.iot_sqt_agv_count where time > now()-interval '1d');

"Foreign Scan on dm.iot_sqt_agv_count  (cost=865.14..859.14 rows=4 width=72)"
"  Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value"
"  InfluxDB query: SELECT * FROM ""F3AGV.count"" WHERE ((time = $1))"
"  InitPlan 1 (returns $0)"
"    ->  Aggregate  (cost=855.13..855.14 rows=1 width=8)"
"          Output: max(iot_sqt_agv_count_1.""time"")"
"          ->  Foreign Scan on dm.iot_sqt_agv_count iot_sqt_agv_count_1  (cost=10.00..853.00 rows=853 width=8)"
"                Output: iot_sqt_agv_count_1.""time"", iot_sqt_agv_count_1.""agvState"", iot_sqt_agv_count_1.value"
"                InfluxDB query: SELECT * FROM ""F3AGV.count"" WHERE ((time > (now() - 1d0h0m0s0u)))"

Perhaps influxdb_fdw does not yet support InitPlan type operators.

cobolbaby commented 7 months ago

could you try the same query with window functions or with HAVING


explain analyse verbose 
SELECT *
FROM (
    SELECT *,
           MAX(time) OVER () AS max_time
    FROM dm.iot_sqt_agv_count
    WHERE time > NOW() - INTERVAL '1 day'
) AS subquery
WHERE time = subquery.max_time;

"Subquery Scan on subquery  (cost=10.00..291.10 rows=1 width=80) (actual time=159.955..159.961 rows=5 loops=1)"
"  Output: subquery.""time"", subquery.""agvState"", subquery.value, subquery.max_time"
"  Filter: (subquery.""time"" = subquery.max_time)"
"  Rows Removed by Filter: 11510"
"  ->  WindowAgg  (cost=10.00..287.55 rows=284 width=80) (actual time=154.067..158.294 rows=11515 loops=1)"
"        Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value, max(iot_sqt_agv_count.""time"") OVER (?)"
"        ->  Foreign Scan on dm.iot_sqt_agv_count  (cost=10.00..284.00 rows=284 width=72) (actual time=103.052..147.959 rows=11515 loops=1)"
"              Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value"
"              InfluxDB query: SELECT * FROM ""F3AGV.count"" WHERE ((time > (now() - 1d0h0m0s0u)))"
"Planning Time: 0.195 ms"
"Execution Time: 160.233 ms"

explain analyse verbose 
WITH max_time_cte AS (
    SELECT MAX(time) AS max_time
    FROM dm.iot_sqt_agv_count
    WHERE time > NOW() - INTERVAL '1 day'
)
SELECT iot.*
FROM dm.iot_sqt_agv_count iot
JOIN max_time_cte max_time ON iot.time = max_time.max_time;

"Hash Join  (cost=865.17..1710.45 rows=4 width=72) (actual time=5676.963..5691.366 rows=8 loops=1)"
"  Output: iot.""time"", iot.""agvState"", iot.value"
"  Inner Unique: true"
"  Hash Cond: (iot.""time"" = (max(iot_sqt_agv_count.""time"")))"
"  ->  Foreign Scan on dm.iot_sqt_agv_count iot  (cost=10.00..853.00 rows=853 width=72) (actual time=3916.961..5481.379 rows=405138 loops=1)"
"        Output: iot.""time"", iot.""agvState"", iot.value"
"        InfluxDB query: SELECT * FROM ""F3AGV.count"""
"  ->  Hash  (cost=855.15..855.15 rows=1 width=8) (actual time=148.130..148.131 rows=1 loops=1)"
"        Output: (max(iot_sqt_agv_count.""time""))"
"        Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"        ->  Aggregate  (cost=855.13..855.14 rows=1 width=8) (actual time=148.114..148.115 rows=1 loops=1)"
"              Output: max(iot_sqt_agv_count.""time"")"
"              ->  Foreign Scan on dm.iot_sqt_agv_count  (cost=10.00..853.00 rows=853 width=8) (actual time=98.425..146.246 rows=11518 loops=1)"
"                    Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value"
"                    InfluxDB query: SELECT * FROM ""F3AGV.count"" WHERE ((time > (now() - 1d0h0m0s0u)))"
"Planning Time: 0.456 ms"
"Execution Time: 5698.183 ms"
cobolbaby commented 7 months ago

I have never works with InfluxDB and I am not PGSpider team member

Many thanks for your reply. Could you inform me about the current maintainer(s) of the project?

mkgrgis commented 7 months ago

@cobolbaby , do yo think this is normal for InfluxDB or the query SELECT * FROM ""F3AGV.count"" WHERE ((time > (now() - 1d0h0m0s0u)))" is not fully correct? For me there is double quoting and unusual interval. Have I mistaken here?

Many thanks for your reply. Could you inform me about the current maintainer(s) of the project?

Yes. The main maintainer of PGSpider projects is @t-kataym. He makes final reviews of my PRs and knows who can help with some little details of code or usage.

cobolbaby commented 7 months ago

For me there is double quoting and unusual interval. Have I mistaken here?

No problem here. The presence of double quotation marks is because I copied the output results from the web ui of pgAdmin.

mkgrgis commented 7 months ago

@cobolbaby, I haven't understood are there any crashes in case of window functions or HAVING? https://github.com/pgspider/influxdb_fdw/issues/40#issuecomment-1842086384

cobolbaby commented 7 months ago

I haven't understood are there any crashes in case of window functions or HAVING? https://github.com/pgspider/influxdb_fdw/issues/40#issuecomment-1842086384

No, the PG wonld not crash in that case.

Perhaps influxdb_fdw does not yet support InitPlan type operators.

@mkgrgis Regarding this issue, can you locate the problematic code block?

mkgrgis commented 6 months ago

Regarding this issue, can you locate the problematic code block?

@cobolbaby , InitPlan look like some execution function in PostgreSQL see https://doxygen.postgresql.org/execMain_8c.html#ae7f37b09dfd050780c4751568a93184f Unfortunately I have no normal experience with plan nodes. Also I can tell you there is no special InitPlan code infrastructure in this FDW, in sqlite_fdw, in firebird_fdw or in oracle_fdw. Look like InitPlan is not one of FDW code infrastructure subjects. Hence I can not understand what exact means "support InitPlan type operators". I think this is implemented in PostgreSQL main code for all FDWs and not only for FDWs.

cobolbaby commented 6 months ago

If the problem isn't with InitPlan, I'm really puzzled about where the issue might be. It's not easy for me to debug step-by-step.

mkgrgis commented 6 months ago

It's not easy for me to debug step-by-step.

It's mostly impossible for beginners, see official description from PGDG. This is also not for me. Even if there is simple testcase like in https://github.com/ibarwick/fbsql/issues/1 gdb debugging in this simple case is not very representative, see discussion around of. In my experience only possible way based on set client_min_messages='debugN'; with some optimal N and speculated code rethinking with a tested changes. This is sometimes enough for such complex and not easy commits like this.

I'm really puzzled about where the issue might be.

The problem is in some of internal plan nodes. Maybe some buffers is not resetted between internal SELECT * FROM "F3AGV.count" WHERE ((time > (now() - 1d0h0m0s0u))) and resulted SELECT * FROM "F3AGV.count" WHERE ((time = $1)). For me it's first version based on usual plan executing with 1 query only.

mkgrgis commented 6 months ago

@cobolbaby , what about psql listing with set client_min_messages='debug4'; and the crash? Most of main FDW function have DEBUG3 initial message. Now you have published only server log.

cobolbaby commented 6 months ago
DEBUG:  influxdb_fdw : influxdb_fdw_handler
DEBUG:  influxdb_fdw : influxdbGetForeignRelSize
DEBUG:  influxdb_fdw : influxdbGetForeignPaths
DEBUG:  influxdb_fdw : influxdbGetForeignRelSize
DEBUG:  influxdb_fdw : influxdbGetForeignPaths
DEBUG:  influxdb_fdw : influxdbGetForeignRelSize
DEBUG:  influxdb_fdw : influxdbGetForeignPaths
DEBUG:  influxdb_fdw : influxdbGetForeignUpperPaths
DEBUG:  influxdb_fdw : influxdbGetForeignUpperPaths
DEBUG:  influxdb_fdw : influxdbGetForeignPlan
DEBUG:  influxdb_fdw : influxdbGetForeignRelSize
DEBUG:  influxdb_fdw : influxdbGetForeignPaths
DEBUG:  influxdb_fdw : influxdbGetForeignUpperPaths
DEBUG:  influxdb_fdw : influxdbGetForeignPlan
DEBUG:  influxdb_fdw : influxdbBeginForeignScan
DEBUG:  influxdb_fdw : influxdbBeginForeignScan
DEBUG:  influxdb_fdw : influxdbIterateForeignScan
DEBUG:  influxdb_fdw : influxdbIterateForeignScan
DEBUG:  influxdb_fdw: new InfluxDB connection 0x5d52c78fd050 for server "influxdb_iot" (user mapping oid 57344, userid 10)
DEBUG:  influxdb_fdw : query: SELECT * FROM "F3AGV.count" WHERE ((time > (now() - 1d0h0m0s0u)))
DEBUG:  influxdb_fdw : influxdbIterateForeignScan
...
...
DEBUG:  influxdb_fdw : influxdbIterateForeignScan
DEBUG:  influxdb_fdw : influxdbIterateForeignScan
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
mkgrgis commented 6 months ago

So. the problem is near influxdbIterateForeignScan https://github.com/pgspider/influxdb_fdw/blob/c7cc9f2138b87b6761476fed11381104022bb0de/influxdb_fdw.c#L1593-L1594 during second query SELECT * FROM "F3AGV.count" WHERE ((time = $1)). Also can be affected influxdbBeginForeignScan or influxdbEndForeignScan if there is problem of unclear buffer.

cobolbaby commented 1 month ago

Magical phenomenon: If the timestamp comparison is converted into a string comparison, or the equality comparison is converted into an in substatement, it can be executed.

explain analyse verbose
select time,"agvState",value
        from dm.iot_sqt_agv_count
        where to_char(time, 'YYYY-MM-DD HH24:MI:SS') =
(SELECT to_char(max(time), 'YYYY-MM-DD HH24:MI:SS') from dm.iot_sqt_agv_count where time > now() - interval '1d')

/*
"Foreign Scan on dm.iot_sqt_agv_count  (cost=865.15..859.15 rows=4 width=72) (actual time=5984.461..5996.245 rows=8 loops=1)"
"  Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value"
"  Filter: (to_char(iot_sqt_agv_count.""time"", 'YYYY-MM-DD HH24:MI:SS'::text) = $0)"
"  Rows Removed by Filter: 402777"
"  InfluxDB query: SELECT * FROM ""F3AGV.count"""
"  InitPlan 1 (returns $0)"
"    ->  Aggregate  (cost=855.13..855.15 rows=1 width=32) (actual time=143.270..143.272 rows=1 loops=1)"
"          Output: to_char(max(iot_sqt_agv_count_1.""time""), 'YYYY-MM-DD HH24:MI:SS'::text)"
"          ->  Foreign Scan on dm.iot_sqt_agv_count iot_sqt_agv_count_1  (cost=10.00..853.00 rows=853 width=8) (actual time=99.992..142.025 rows=11516 loops=1)"
"                Output: iot_sqt_agv_count_1.""time"", iot_sqt_agv_count_1.""agvState"", iot_sqt_agv_count_1.value"
"                InfluxDB query: SELECT * FROM ""F3AGV.count"" WHERE ((time > (now() - 1d0h0m0s0u)))"
"Planning Time: 0.478 ms"
"Execution Time: 6014.964 ms"
*/

explain analyse verbose
select time,"agvState",value
        from dm.iot_sqt_agv_count
        where time in 
(SELECT max(time) from dm.iot_sqt_agv_count where time > now() - interval '1d')

/*
"Hash Join  (cost=865.16..1710.44 rows=4 width=72) (actual time=5206.469..5219.522 rows=8 loops=1)"
"  Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value"
"  Inner Unique: true"
"  Hash Cond: (iot_sqt_agv_count.""time"" = (max(iot_sqt_agv_count_1.""time"")))"
"  ->  Foreign Scan on dm.iot_sqt_agv_count  (cost=10.00..853.00 rows=853 width=72) (actual time=3490.447..5005.403 rows=402817 loops=1)"
"        Output: iot_sqt_agv_count.""time"", iot_sqt_agv_count.""agvState"", iot_sqt_agv_count.value"
"        InfluxDB query: SELECT * FROM ""F3AGV.count"""
"  ->  Hash  (cost=855.14..855.14 rows=1 width=8) (actual time=146.651..146.654 rows=1 loops=1)"
"        Output: (max(iot_sqt_agv_count_1.""time""))"
"        Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"        ->  Aggregate  (cost=855.13..855.14 rows=1 width=8) (actual time=146.641..146.643 rows=1 loops=1)"
"              Output: max(iot_sqt_agv_count_1.""time"")"
"              ->  Foreign Scan on dm.iot_sqt_agv_count iot_sqt_agv_count_1  (cost=10.00..853.00 rows=853 width=8) (actual time=103.308..145.374 rows=11516 loops=1)"
"                    Output: iot_sqt_agv_count_1.""time"", iot_sqt_agv_count_1.""agvState"", iot_sqt_agv_count_1.value"
"                    InfluxDB query: SELECT * FROM ""F3AGV.count"" WHERE ((time > (now() - 1d0h0m0s0u)))"
"Planning Time: 0.576 ms"
"Execution Time: 5242.908 ms"
*/

Who can help fix this critical bug? @all

mkgrgis commented 1 month ago

@cobolbaby , look like this repository is not very active. Maybe someone from pgspider team will read our diagnostics.

weiting1lee commented 1 month ago

@cobolbaby and @mkgrgis ,

I apologize for the inconvenience caused by this issue. I'm investigating the segmentation fault, which seems related to InitPlan type operators. I will thoroughly analyze influxdbIterateForeignScan and other related functions.

Thanks for your detailed information and suggestions. I'll keep you updated on our progress.

Thank you for your patience and support!

mkgrgis commented 1 month ago

@weiting1lee , could someone from pgspider team also investigate not compilable for new PostreSQL version dynamodb_fdw/ ?

weiting1lee commented 1 month ago

@mkgrgis I will notify the team internally to follow it.

cobolbaby commented 1 month ago

I'm investigating the segmentation fault, which seems related to InitPlan type operators. I will thoroughly analyze influxdbIterateForeignScan and other related functions.

@weiting1lee Any progress ?

weiting1lee commented 1 month ago

@cobolbaby We are currently investigating the segmentation fault issue you reported. Based on our findings, the problem arises due to the parameter not being initialized during the scanning query execution. We are discussing solutions. We will update here as we make progress.