open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.49k stars 1.04k forks source link

Slow Ingestion with Partitioned Tables #18073

Open datapopcorn opened 1 month ago

datapopcorn commented 1 month ago

Affected module Ingestion

Describe the bug I recently created a schema containing a single table, and the ingestion process completed in less than 3 seconds. However, when I partition this table into 1000 partitions, the ingestion time increased significantly to more than 3 minutes, even though the partition tables themselves are not ingested into OpenMetadata. I am looking for clarification on why the ingestion pipeline slows down and how I can optimize it (perhaps by avoiding partition table checks).

To Reproduce

  1. Create a partitioned table using the following SQL:
    CREATE TABLE events (
    event_id SERIAL,
    event_name VARCHAR(100),
    event_date DATE NOT NULL,
    PRIMARY KEY (event_id, event_date)
    ) PARTITION BY RANGE (event_date);
  2. Create partitions for the table (example for 1000 partitions):
DO $$
DECLARE
    i INT;
BEGIN
    FOR i IN 1..1000 LOOP
        EXECUTE format('
            CREATE TABLE events_partition_%s PARTITION OF events
            FOR VALUES FROM (%L) TO (%L);', i, '2023-01-01'::DATE + (i - 1) * INTERVAL '1 day', '2023-01-01'::DATE + i * INTERVAL '1 day');
    END LOOP;
END $$;
  1. Ingest the schema into OpenMetadata and observe the time taken.
  2. Repeat step 3 with 2000 partition tables and compare the ingestion times.

Observed Behavior The ingestion time increases as the number of partitioned tables grows, despite the partition tables themselves not being ingested into OpenMetadata.

image

Expected behavior Ingestion time should remain relatively consistent regardless of the number of partition tables if they are not being ingested.

Version:

Additional context Slack thread: https://openmetadata.slack.com/archives/C02B6955S4S/p1727689640694269

datapopcorn commented 2 weeks ago

I found that the problem that ingestion time inceased by partition number only when including DDL. If I disable this option, the ingestion time is the same. Can you check why including DDL would cause this issue?