timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.41k stars 872 forks source link

[Bug]: Valgrind detects uninitialized value read during regresscheck-t #5931

Closed alexanderlaw closed 7 months ago

alexanderlaw commented 1 year ago

What type of bug is this?

Other

What subsystems and features are affected?

Multi-node

What happened?

When executing make regresscheck-t under valgrind, I get several anomalies detected, most interesting of which is: ==00:00:37:52.160 2241587== Conditional jump or move depends on uninitialised value(s) ==00:00:37:52.160 2241587== at 0x112496B6: remote_connection_cache_invalidate_callback (connection_cache.c:273)

As can be seen from the stack trace, the function remote_connection_cache_invalidate_callback() tries to access an connection cache entry (namely, read entry->role_hashvalue), that is in process of creating in connection_cache_create_entry(), and has no role_hashvalue field set yet.

TimescaleDB version affected

2.12.0-dev

PostgreSQL version used

15.3

What operating system did you use?

Ubuntu 22.04 x86_64

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

2023-08-07 08:21:20.330 MSK cluster_super_user [2241587] db_cagg_invalidation_dist_ht LOG:  00000: statement: GRANT CREATE ON SCHEMA public TO default_perm_user, default_perm_user_2;
2023-08-07 08:21:20.330 MSK cluster_super_user [2241587] db_cagg_invalidation_dist_ht LOCATION:  exec_simple_query, postgres.c:1050
...
==00:00:37:52.160 2241587== Conditional jump or move depends on uninitialised value(s)
==00:00:37:52.160 2241587==    at 0x112496B6: remote_connection_cache_invalidate_callback (connection_cache.c:273)
==00:00:37:52.160 2241587==    by 0x111D540C: cache_syscache_invalidate (init.c:77)
==00:00:37:52.160 2241587==    by 0x110F9A7C: cache_invalidate_syscache_callback (cache_invalidate.c:118)
==00:00:37:52.160 2241587==    by 0x6340C8: CallSyscacheCallbacks (inval.c:1593)
==00:00:37:52.160 2241587==    by 0x634118: LocalExecuteInvalidationMessage (inval.c:625)
==00:00:37:52.160 2241587==    by 0x501847: ReceiveSharedInvalidMessages (sinval.c:120)
==00:00:37:52.160 2241587==    by 0x6337C3: AcceptInvalidationMessages (inval.c:748)
==00:00:37:52.160 2241587==    by 0x5064A6: LockRelationOid (lmgr.c:137)
==00:00:37:52.160 2241587==    by 0x1E3805: relation_open (relation.c:56)
==00:00:37:52.160 2241587==    by 0x24ED46: table_open (table.c:43)
==00:00:37:52.160 2241587==    by 0x111296DF: index_scanner_open (scanner.c:90)
==00:00:37:52.160 2241587==    by 0x1112992B: ts_scanner_open (scanner.c:255)
==00:00:37:52.160 2241587==    by 0x11129A44: ts_scanner_start_scan (scanner.c:285)
==00:00:37:52.160 2241587==    by 0x11129E51: ts_scanner_scan (scanner.c:462)
==00:00:37:52.160 2241587==    by 0x111427FC: metadata_get_value_internal (metadata.c:112)
==00:00:37:52.160 2241587==    by 0x11142A4D: ts_metadata_get_value (metadata.c:123)
==00:00:37:52.160 2241587==    by 0x1124746C: remote_connection_set_peer_dist_id (connection.c:1353)
==00:00:37:52.160 2241587==    by 0x11247FBD: remote_connection_open_session (connection.c:1776)
==00:00:37:52.160 2241587==    by 0x112480AA: remote_connection_open_session_by_id (connection.c:1800)
==00:00:37:52.160 2241587==    by 0x11249381: connection_cache_create_entry (connection_cache.c:167)
==00:00:37:52.160 2241587==    by 0x110F98DC: ts_cache_fetch (cache.c:199)
==00:00:37:52.160 2241587==    by 0x1124944B: remote_connection_cache_get_connection (connection_cache.c:244)
==00:00:37:52.160 2241587==    by 0x11259844: remote_txn_store_get (txn_store.c:55)
==00:00:37:52.160 2241587==    by 0x1124D5D7: remote_dist_txn_get_connection (dist_txn.c:102)
==00:00:37:52.160 2241587==    by 0x111CFC26: data_node_get_connection (data_node.c:266)
==00:00:37:52.160 2241587==    by 0x1124DA44: ts_dist_multi_cmds_params_invoke_on_data_nodes (dist_commands.c:123)
==00:00:37:52.160 2241587==    by 0x1124DBB2: ts_dist_cmd_params_invoke_on_data_nodes (dist_commands.c:159)
==00:00:37:52.160 2241587==    by 0x1124DBF9: ts_dist_cmd_invoke_on_data_nodes (dist_commands.c:167)
==00:00:37:52.160 2241587==    by 0x1124E0BA: ts_dist_cmd_invoke_on_data_nodes_using_search_path (dist_commands.c:190)
==00:00:37:52.160 2241587==    by 0x112540CE: dist_ddl_execute (dist_ddl.c:1036)
==00:00:37:52.160 2241587==    by 0x11254214: dist_ddl_start (dist_ddl.c:1211)
==00:00:37:52.160 2241587==    by 0x111D70A4: tsl_ddl_command_start (process_utility.c:29)
==00:00:37:52.160 2241587==    by 0x11126CA8: timescaledb_ddl_command_start (process_utility.c:4539)
==00:00:37:52.160 2241587==    by 0x52326F: ProcessUtility (utility.c:526)
==00:00:37:52.160 2241587==    by 0x520B2B: PortalRunUtility (pquery.c:1158)
==00:00:37:52.160 2241587==    by 0x520D8D: PortalRunMulti (pquery.c:1315)
==00:00:37:52.160 2241587==    by 0x521070: PortalRun (pquery.c:791)
==00:00:37:52.160 2241587==    by 0x51D716: exec_simple_query (postgres.c:1250)
==00:00:37:52.160 2241587==    by 0x51F59D: PostgresMain (postgres.c:4598)
==00:00:37:52.160 2241587==    by 0x496E19: BackendRun (postmaster.c:4511)
==00:00:37:52.160 2241587==    by 0x499B9D: BackendStartup (postmaster.c:4239)

How can we reproduce the bug?

The issue can be reproduced with the following script:
psql -c "CREATE EXTENSION timescaledb;"

cat << 'EOF' | psql &
SELECT add_data_node('node1', host => 'localhost', DATABASE => 'node1');
SELECT pg_sleep(10);
GRANT CREATE ON SCHEMA public TO public;
EOF

cat << 'EOF' | psql &
SELECT pg_sleep(20);
CREATE ROLE user1;
SELECT pg_sleep(0.5);
CREATE ROLE user2;
SELECT pg_sleep(0.5);
CREATE ROLE user3;
SELECT pg_sleep(0.5);
CREATE ROLE user4;
SELECT pg_sleep(0.5);
CREATE ROLE user5;
SELECT pg_sleep(0.5);
EOF
wait

With the following parameters in postgresql.conf:
shared_preload_libraries=timescaledb
timescaledb_experimental.enable_distributed_ddl=on
max_prepared_transactions=10
wal_level='logical'
fabriziomello commented 7 months ago

In 2.13.0 we announced the deprecation of multi-node and it will be complete removed from the upcoming 2.14.0.