slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
301 stars 16 forks source link

Consider Replication Level Pre-SQL & Post-SQL #231

Closed elielhaouzi closed 1 month ago

elielhaouzi commented 3 months ago

I noticed that when using the truncate mode for moving data between databases, there's an oversight in handling tables with foreign key constraints.

Here's an excerpt from the configuration:

defaults:
  mode: truncate
  object: "{stream_schema}.{stream_table}"

streams:
  public.store_user:
    sql: SELECT * FROM {stream_schema}.{stream_table} WHERE store_id = '{store_id}'

  public.users:

 env:
  SLING_ALLOW_EMPTY_TABLES: true
  store_id: "${store_id}"

The issue arises when the code attempts to truncate the store_user table and subsequently insert data into it, followed by a similar operation on the users table. Due to foreign key constraints, this operation fails.

Expected Behavior: Ideally, Sling should handle truncating and inserting data into tables with foreign key constraints seamlessly, ensuring data integrity is maintained throughout the process. This could involve temporarily disabling foreign key constraints, truncating tables in the correct order, or utilizing other strategies to ensure a smooth data transfer process.

I am running on PostgreSQL 15.5.

Also, when I add pre_sql: truncate {stream_schema}.{target_table} cascade; in defaults target_options, it fails on the insert of the data. insert or update on table "store_user" violates foreign key constraint "store_user_store_id_fkey"

flarco commented 3 months ago

Hey Eli, this seems like it's an order issue. Sling will respect the order in which you input your streams.

Can you try public.users first?

streams:
  public.users:

  public.store_user:
    sql: SELECT * FROM {stream_schema}.{stream_table} WHERE store_id = '{store_id}'

Now, if you're using public.*, I could see that not working because sling will order it alphabetically. The logic to determine the foreign key dependency order will have to be implemented.

elielhaouzi commented 3 months ago

Hey @flarco,

Thank you for your quick response.

it does not seem to work.

I created a simple way to recreate the case.

Database 1:

CREATE DATABASE database_1;

-- drop table user_comments;
-- DROP table users;

CREATE TABLE users
(
    id bigint NOT NULL,
    CONSTRAINT users_pkey PRIMARY KEY (id)
);

CREATE TABLE public.user_comments
(
    id bigint NOT NULL,
    user_id bigint NOT NULL,
    comment character varying(255),
    CONSTRAINT user_comment_pkey PRIMARY KEY (id),
    CONSTRAINT user_comments_user_id_fkey FOREIGN KEY (user_id)
        REFERENCES public.users (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

INSERT INTO users (id) VALUES
(1),
(2),
(3);

INSERT INTO user_comments (user_id, comment) VALUES
(1, 1, 'a'),
(2, 1, 'b'),
(3, 2, 'c'),
(4, 3, 'd');

Database 2:

CREATE DATABASE database_2;

-- drop table user_comments;
-- DROP table users;

CREATE TABLE users
(
    id bigint NOT NULL,
    CONSTRAINT users_pkey PRIMARY KEY (id)
);

CREATE TABLE public.user_comments
(
    id bigint NOT NULL,
    user_id bigint NOT NULL,
    comment character varying(255),
    CONSTRAINT user_comment_pkey PRIMARY KEY (id),
    CONSTRAINT user_comments_user_id_fkey FOREIGN KEY (user_id)
        REFERENCES public.users (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

INSERT INTO users (id) VALUES
(4),
(5),
(6);

INSERT INTO user_comments (id, user_id, comment) VALUES
(1, 4, 'a'),
(2, 4, 'b'),
(3, 5, 'c'),
(4, 6, 'd');

sling conns set PG_DATABASE_1 url="postgresql://postgres:postgres@127.0.0.1:5432/database_1?sslmode=disable" sling conns set PG_DATABASE_2 url="postgresql://postgres:postgres@127.0.0.1:5432/database_2?sslmode=disable"

and here is the replication-test.yaml

source: PG_DATABASE_1
target: PG_DATABASE_2

defaults:
  mode: truncate
  object: "{stream_schema}.{stream_table}"

streams:
  public.users:

  public.user_comments:
    sql: SELECT * FROM {stream_schema}.{stream_table}

sling run -d -r replication-test.yaml

I tried both orders and it failed.

flarco commented 3 months ago

Thanks for the details, will take a deeper look.

flarco commented 3 months ago

Grokking at the code a bit, I see what you mean. It's failing either way because it's a 2 way foreign-key reference. Yeah, sling would have to pre_sql/post_sql at replication level, instead of stream level with something like this: https://stackoverflow.com/a/49584660/2295355

🤔

Another thought, I see your tables have a PK. Why not use the incremental mode with PK only (omit the update key)? No need to truncate, it will just merge/upsert the full data from source into target table. See https://docs.slingdata.io/sling-cli/run/configuration#incremental-mode-strategies

source: PG_DATABASE_1
target: PG_DATABASE_2

defaults:
  mode: incremental
  primary_key: [id]
  object: "{stream_schema}.{stream_table}"

streams:
  public.users:

  public.user_comments:
    sql: SELECT * FROM {stream_schema}.{stream_table}
elielhaouzi commented 3 months ago

I did not use the incremental since I want the database_2 will be an exact copy of the database_1, taking into account that it is possible that I added data in database_2 which is not in database_1.

Yes, I also think the solution of a pre_sql/post_sql at a replication level will be good :-D

elielhaouzi commented 3 months ago

Much appreciation, @flarco! Sling is an incredible tool. I helps me remove my old bash scripts to a simple yaml file. It's remarkable to think that one individual is responsible for its creation. Thank you

flarco commented 3 months ago

Cool, will think about this.

For now, actually, as a workaround to emulate pre_sql/post_sql at replication level, you could just add pre_sql to the first stream and post_sql to the last stream. Only issue, is there is no guarantee the last post_sql will execute if the last replication fails. So you could set SET session_replication_role = 'replica'; or something like in 1st pre_sql and SET session_replication_role = 'origin'; in the last post_sql. However, not sure how this will handle since sling is opening/closing multiple transactions... Maybe ALTER TABLE <table> <DISABLE / ENABLE> TRIGGER ALL might work best.

source: PG_DATABASE_1
target: PG_DATABASE_2

defaults:
  mode: truncate
  object: "{stream_schema}.{stream_table}"

streams:
  public.users:
    target_options:
      pre_sql: SET session_replication_role = 'replica';

  public.user_comments:
    sql: SELECT * FROM {stream_schema}.{stream_table}
    target_options:
      post_sql: SET session_replication_role = 'origin';
elielhaouzi commented 3 months ago

Cool, I will try it.

I thought for now to run as part of a script a truncate table users_comments, users sql command before running the sling command. I think it will also work.

flarco commented 1 month ago

Closing this. After thinking about this, I think it's best to leave things the way they are at the moment in the name of simplicity. I may revisit this idea in the future.