timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
16.8k stars 852 forks source link

[Bug]: child rel 1 not found in append_rel_array when using to_timestamp() with bigint timestamp #7078

Open chris-morris-h2o opened 3 days ago

chris-morris-h2o commented 3 days ago

What type of bug is this?

Unexpected error

What subsystems and features are affected?

Query executor

What happened?

When querying a hyper table adding "to_timestamp(t_stamp/1000.0)" causes the following error but only for some chunks:

ERROR:  child rel 1 not found in append_rel_array 

SQL state: XX000

An example query that produces this message when executed from pgAdmin:

select *, to_timestamp(t_stamp/1000.0) from sqlth_1_data where tagid in (483426,
565356,
602444,
603707,
609154,
928947)
    AND t_stamp < 1719578853000
    AND t_stamp > 1704088800000

And another example that produces the error when executed form pgAdmin:

select *, to_timestamp(t_stamp/1000.0) from sqlth_1_data where tagid = 928947
    AND t_stamp < 1719578853000
    AND t_stamp > 1704088800000

I set up a Python script to see if I had a corrupted chunk or some such where it would loop through my chunks in descending order one day at a time:

central_tz = pytz.timezone('America/Chicago')

# Current date in Central Time
current_date = datetime.now(central_tz).replace(hour=0, minute=0, second=0, microsecond=0)
current_date = current_date - timedelta(days=10)
# Date 2 years ago in Central Time
two_years_ago = current_date - timedelta(days=2*365)

# Iterate over each day from now to 2 years ago
while current_date >= two_years_ago:
    # Calculate start and end timestamps for the current day in Central Time
    start_timestamp = int(current_date.timestamp() * 1000)
    end_date = current_date - timedelta(days=1)
    end_timestamp = int(end_date.timestamp() * 1000)

    # SQL query
    query = f"""
    SELECT tagid, intvalue, floatvalue, stringvalue, datevalue, dataintegrity, t_stamp
    FROM sqlth_1_data
    WHERE t_stamp < {start_timestamp} AND t_stamp > {end_timestamp} and 
       tagid in (483426,
565356,
602444,
603707,
609154,
928947)
    LIMIT 1;
    """
    try:
        with conn.cursor(name='chunk_checker_cursor') as cursor:
            cursor.execute(query)
            result = cursor.fetchone()
            if result:
                logging.info(f"Date range: {end_date.strftime('%Y-%m-%d')} to {current_date.strftime('%Y-%m-%d')}: Result: {result}")
            else:
                logging.info(f"Date range: {end_date.strftime('%Y-%m-%d')} to {current_date.strftime('%Y-%m-%d')}: No results")
        conn.commit()
    except Exception as e:
        logging.error(f"Date range: {end_date.strftime('%Y-%m-%d')} to {current_date.strftime('%Y-%m-%d')}: Error: {str(e)}")
        conn.rollback()

    # Move to the previous day
    current_date = end_date

# Close the connection
conn.close()

Sample log from the code above:

2024-06-28 08:47:18,419 - INFO - Date range: 2024-06-17 to 2024-06-18: Result: (609154, None, 0.0, None, None, 192, 1718671630372)
2024-06-28 08:47:18,543 - INFO - Date range: 2024-06-16 to 2024-06-17: Result: (609154, None, 0.0, None, None, 192, 1718575240435)
2024-06-28 08:47:18,676 - INFO - Date range: 2024-06-15 to 2024-06-16: Result: (609154, None, 0.0, None, None, 192, 1718491190342)
2024-06-28 08:47:18,796 - INFO - Date range: 2024-06-14 to 2024-06-15: Result: (609154, None, 0.0, None, None, 192, 1718403885422)

If I add ", to_timestamp(t_stamp/1000.0)" after the columns:


SELECT tagid, intvalue, floatvalue, stringvalue, datevalue, dataintegrity, t_stamp, to_timestamp(t_stamp/1000.0)
    FROM sqlth_1_data
    WHERE t_stamp < {start_timestamp} AND t_stamp > {end_timestamp} and 
       tagid in (483426,
565356,
602444,
603707,
609154,
928947)
    LIMIT 1;

I end up with the error message again but not in every chunk, but a large portion of them.

2024-06-28 08:49:04,706 - INFO - Date range: 2024-04-25 to 2024-04-26: Result: (609154, None, 0.0, None, None, 192, 1714081565466, datetime.datetime(2024, 4, 25, 21, 46, 5, 466000, tzinfo=datetime.timezone.utc))
2024-06-28 08:49:04,900 - ERROR - Date range: 2024-04-24 to 2024-04-25: Error: child rel 1 not found in append_rel_array
====ERROR MESSAGE CONTINUES FOR EVERY DAY UNTIL=====
2024-06-28 08:50:41,102 - ERROR - Date range: 2023-02-13 to 2023-02-14: Error: child rel 1 not found in append_rel_array
2024-06-28 08:50:41,203 - INFO - Date range: 2023-02-12 to 2023-02-13: No results
2024-06-28 08:50:41,395 - ERROR - Date range: 2023-02-11 to 2023-02-12: Error: child rel 1 not found in append_rel_array
2024-06-28 08:50:41,544 - ERROR - Date range: 2023-02-10 to 2023-02-11: Error: child rel 1 not found in append_rel_array
2024-06-28 08:50:41,659 - INFO - Date range: 2023-02-09 to 2023-02-10: No results
2024-06-28 08:50:41,787 - INFO - Date range: 2023-02-08 to 2023-02-09: No results

Removing to_timestamp(t_stamp/1000.0) from the two example queries at the start of the issue also cause them to not produce the error and instead return rows (or no rows if no rows found). I also tried variations on the where clause:

Using an asterisk instead: *, to_timestamp(t_stamp/1000.0)

Getting rid of the decimal on the 1000: *, to_timestamp(t_stamp/1000)

Also the to_timestamp call by itself still causes the error.

At first I thought maybe it was only happening on uncompressed chunks but I get a result from some compressed chunks and uncompressed chunks with my script:

2024-06-28 08:49:33,200 - INFO - Date range: 2024-04-25 to 2024-04-26: Result: (609154, None, 0.0, None, None, 192, 1714081565466, datetime.datetime(2024, 4, 25, 21, 46, 5, 466000, tzinfo=datetime.timezone.utc))

Is compressed according to my query:

select hypertable_name, chunk_name, to_timestamp(range_start_integer/1000.0),
to_timestamp(range_end_integer/1000.0) from timescaledb_information.chunks where  is_compressed = false AND
    hypertable_name = 'sqlth_1_data'
    order by range_start_integer

Which outputs this as the earliest two uncompressed chunks (got some data inserted with a bad timestamp):

"sqlth_1_data"  "_hyper_282_12123_chunk"    "1997-04-09 00:00:00+00"    "1997-04-10 00:00:00+00"
"sqlth_1_data"  "_hyper_282_6312_chunk" "2024-05-18 00:00:00+00"    "2024-05-19 00:00:00+00"

Then I thought maybe somehow the t_stamp column got removed from some chunks, but looping through them day by day with every column specified in the select statement shows this to not be true.

I also tried running some of the variations above with SET client_min_messages TO DEBUG5; but the output didn't change.

Here is my table creation script from pgAdmin for my hypertable:

-- Table: public.sqlth_1_data

-- DROP TABLE IF EXISTS public.sqlth_1_data;

CREATE TABLE IF NOT EXISTS public.sqlth_1_data
(
    tagid integer NOT NULL,
    intvalue bigint,
    floatvalue real,
    stringvalue character varying(255) COLLATE pg_catalog."default",
    datevalue timestamp without time zone,
    dataintegrity integer,
    t_stamp bigint NOT NULL,
    CONSTRAINT "PK__sqlth_1___BE126DD12A1EC133" PRIMARY KEY (tagid, t_stamp)
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.sqlth_1_data
    OWNER to postgres;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN tagid SET STATISTICS 2000;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN intvalue SET STATISTICS 2000;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN floatvalue SET STATISTICS 2000;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN stringvalue SET STATISTICS 1500;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN datevalue SET STATISTICS 2000;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN dataintegrity SET STATISTICS 1000;

ALTER TABLE IF EXISTS public.sqlth_1_data
    ALTER COLUMN t_stamp SET STATISTICS 2000;
-- Index: idx_sqlth_optimized_new

-- DROP INDEX IF EXISTS public.idx_sqlth_optimized_new;

CREATE INDEX IF NOT EXISTS idx_sqlth_optimized_new
    ON public.sqlth_1_data USING btree
    (tagid ASC NULLS LAST, t_stamp DESC NULLS FIRST)
    INCLUDE(intvalue, floatvalue, stringvalue, datevalue, dataintegrity)
    TABLESPACE pg_default;

-- Trigger: ts_cagg_invalidation_trigger

-- DROP TRIGGER IF EXISTS ts_cagg_invalidation_trigger ON public.sqlth_1_data;

CREATE OR REPLACE TRIGGER ts_cagg_invalidation_trigger
    AFTER INSERT OR DELETE OR UPDATE 
    ON public.sqlth_1_data
    FOR EACH ROW
    EXECUTE FUNCTION _timescaledb_internal.continuous_agg_invalidation_trigger('282');

-- Trigger: ts_insert_blocker

-- DROP TRIGGER IF EXISTS ts_insert_blocker ON public.sqlth_1_data;

CREATE OR REPLACE TRIGGER ts_insert_blocker
    BEFORE INSERT
    ON public.sqlth_1_data
    FOR EACH ROW
    EXECUTE FUNCTION _timescaledb_internal.insert_blocker();

We are using 24 hour chunks on our hypertable. We compress after 45 days. We have around 77 terabytes uncompressed and 3.5 terabytes compressed across 3 databases running on this one self managed Postgres server. The server is on an Azure VM, Standard E48ds v5 with 48 vCPUs and 384 GB of RAM.

It is also running on top of OpenZFS. I can provide any other relevant information needed.

TimescaleDB version affected

2.11

PostgreSQL version used

14.12

What operating system did you use?

Ubuntu 22.04 x64

What installation method did you use?

Deb/Apt

What platform did you run on?

Microsoft Azure Cloud

Relevant log output and stack trace

From my postgresql log on the server itself I don't have any additional information. I can enable additional logging or provide additional information as needed.

 Example of it looking the same:

2024-06-28 14:14:30.136 UTC [2057310] postgres@history_sdb01 ERROR:  child rel 1 not found in append_rel_array
2024-06-28 14:14:30.136 UTC [2057310] postgres@history_sdb01 STATEMENT:  SET client_min_messages TO DEBUG5;
        select *, to_timestamp(t_stamp/1000.0) from sqlth_1_data where tagid = 928947
            AND t_stamp < 1719578853000
                AND t_stamp > 1704088800000
2024-06-28 14:15:17.057 UTC [2057310] postgres@history_sdb01 ERROR:  child rel 1 not found in append_rel_array
2024-06-28 14:15:17.057 UTC [2057310] postgres@history_sdb01 STATEMENT:  SET client_min_messages TO DEBUG5;
        select *, to_timestamp(t_stamp/1000.0) from sqlth_1_data where tagid = 928947
            AND t_stamp < 1719578853000
                AND t_stamp > 1704088800000

How can we reproduce the bug?

I am not sure how to reproduce this bug. I can reproduce it on my side on demand but I have no idea why it happens on some chunks and not others and only when to_timestamp() is called.
chris-morris-h2o commented 1 day ago

I saw another similar bug here (https://github.com/timescale/timescaledb/issues/6140) where it was suggested to run within psql with \set VERBOSITY verbose. I did this and received this output:

postgres=# \set VERBOSITY verbose
postgres=# \c history_sdb01
You are now connected to database "history_sdb01" as user "postgres".
history_sdb01=# select *, to_timestamp(t_stamp/1000.0) from sqlth_1_data where tagid in (483426,
565356,
602444,
603707,
609154,
928947)
    AND t_stamp < 1719578853000
        AND t_stamp > 1704088800000;
ERROR:  XX000: child rel 1 not found in append_rel_array
LOCATION:  find_appinfos_by_relids, appendinfo.c:730
chris-morris-h2o commented 1 day ago

I ran some more tests, it is not isolated to the to_timestamp() function, any operations or casting on the t_stamp column cause the bug:

history_sdb01=# SELECT *, t_stamp::numeric as t_stamp_numeric
FROM sqlth_1_data
WHERE tagid IN (483426, 565356, 602444, 603707, 609154, 928947)
    AND t_stamp < 1719578853000
    AND t_stamp > 1704088800000;
ERROR:  XX000: child rel 1 not found in append_rel_array
LOCATION:  find_appinfos_by_relids, appendinfo.c:730

But it doesn't seem it is just the t_stamp column, adding +1 to the tagid column also triggers the issue:

    query = f"""
    SELECT tagid, intvalue, floatvalue, stringvalue, datevalue, dataintegrity, t_stamp, tagid+1
    FROM sqlth_1_data
    WHERE t_stamp < {start_timestamp} AND t_stamp > {end_timestamp} and 
       tagid in (483426,
565356,
602444,
603707,
609154,
928947)
    LIMIT 1;
    """

2024-07-01 09:37:04,679 - INFO - Date range: 2024-04-27 to 2024-04-28: Result: (609154, None, 0.0, None, None, 192, 1714258210519, 609155)
2024-07-01 09:37:04,767 - INFO - Date range: 2024-04-26 to 2024-04-27: Result: (609154, None, 0.0, None, None, 192, 1714173390490, 609155)
2024-07-01 09:37:04,853 - INFO - Date range: 2024-04-25 to 2024-04-26: Result: (609154, None, 0.0, None, None, 192, 1714081565466, 609155)
2024-07-01 09:37:04,982 - ERROR - Date range: 2024-04-24 to 2024-04-25: Error: child rel 1 not found in append_rel_array

2024-07-01 09:37:05,103 - ERROR - Date range: 2024-04-23 to 2024-04-24: Error: child rel 1 not found in append_rel_array

2024-07-01 09:37:05,163 - ERROR - Date range: 2024-04-22 to 2024-04-23: Error: child rel 1 not found in append_rel_array

As well as if it is in a CTE:

history_sdb01=# WITH base_data AS (
  SELECT *
  FROM sqlth_1_data
  WHERE tagid IN (483426, 565356, 602444, 603707, 609154, 928947)
      AND t_stamp < 1719578853000
      AND t_stamp > 1704088800000
)
SELECT *, to_timestamp(t_stamp/1000.0) FROM base_data;
ERROR:  XX000: child rel 2 not found in append_rel_array
LOCATION:  find_appinfos_by_relids, appendinfo.c:730