toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.14k stars 180 forks source link

Data is not synced with ElasticSearch when multiple schema and tables are defined in schema.json #185

Open nandeeshnaik opened 2 years ago

nandeeshnaik commented 2 years ago

PGSync version: latest

Postgres version: 13.3

Elasticsearch version: 7.14.1

Redis version: 6.2.5

Python version: 3.7

Problem Description:

I have set up the project from https://github.com/toluaina/pgsync. Given example of airbnb with users and few others tables data syncs properly with ElasticSearch. Any updates made in data are pushed to ElasticSearch. . With the same set up and configuration I have changed the postgres connection (in docker-compose) to my local postgres server. I wanted to check how PgSync will work with my multi-schema structure. Changes:

 environment:
      - PG_USER=postgres
      - PG_HOST=host.docker.internal
      - PG_PASSWORD=postgres

Now I changed the schema.json to point to my local database and table. Example schema.json

[
    {
        "database": "knowledge",
        "index": "cuddle_bauer_business_objects",
        "nodes": {
            "table": "business_objects",
            "schema":"cuddle_bauer",
            "columns": [
                "id",
                "type",
                "value",
                "label",
                "synonyms"
            ]
        }
    }
]    

With this configuration I'm able to get the data in ElasticSearch under index cuddle_bauer_business_objects. Any change in DB is reflecting in ElasticSearch.

Then I updated the shema.json to capture multiple tables in different schemas. I followed PgSync homepage (https://pgsync.com/tutorial/multiple-schemas-in-single-config/ ) for this.

[
    {
        "database": "knowledge",
        "index": "cuddle_bauer_business_objects",
        "nodes": {
            "table": "business_objects",
            "schema":"cuddle_bauer",
            "columns": [
                "id",
                "type",
                "value",
                "label",
                "synonyms"
            ]
        }
    },
    {
        "database": "knowledge",
        "index": "cuddle_dev_business_objects",
        "nodes": {
            "table": "business_objects",
            "schema":"cuddle_dev",
            "columns": [
                "id",
                "type",
                "value",
                "label",
                "synonyms"
            ]
        }
    },
    {
        "database": "knowledge",
        "index": "cuddle_lstmtesting_business_objects",
        "nodes": {
            "table": "business_objects",
            "schema":"cuddle_lstmtesting",
            "columns": [
                "id",
                "type",
                "value",
                "label",
                "synonyms"
            ]
        }
    }
]

After I make this change, initially data is loaded to ElasticSearch but change in table is not picked up by PgSync. I see the following logs in PgSync.


pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...
pgsync_1         | Syncing knowledge db: [0] => redis: [0] => elastic: [0]...

What is the issue here.? is there any error in the way multiple schemas are defined?

Note: Every table in a different schema has a primary Key.

Error Message (if any):

I don't see any error in PgSync, I think it is just not picking up updates when multiple tables are included. 
toluaina commented 2 years ago

Did you re-run bootstrap command after changing the schema.json? Also delete any checkpoint files starting with a dot

nandeeshnaik commented 2 years ago

Yes, I re-ran the bootstrap command after schema.json changes. New indexes are created in ElasticSearch and for the first time all the data are loaded in ElasticSearch, but the updates to database tables are not propagated to ElasticSearch. Here I have different schema names but table names are the same, could that be the reason for the issue?

enterpinamullah commented 2 years ago

same issue here!

liashchynskyi commented 2 years ago

@toluaina Any updates on this?

toluaina commented 2 years ago

Are you absolutely sure about this? Also can you confirm you re-ran bootstrap after changing the schema. i.e adding a new index/database pair. I am unable to reproduce this .

divit00 commented 2 years ago

This kind of issue happens because the checkpoint file is named as "<database name>_<index name>" so now if you are fetching multiple tables from same database and sending it to one elasticsearch index, the checkpoint file will be same for all. The first one will overwrite the checkpoint file to latest and rest of the tables wont be picked up.

There should be an option to set checkpoint file in schema.json or the format should be <database name>_<schema_name>_<table_name>_<index name>

toluaina commented 2 years ago

@divit00

I don't think this is correct. Each entry/node in the schema.json is unique and identified by its database name and index name. We index documents at the database level, not at the table level. In effect, a node's checkpoint file cannot affect another

divit00 commented 2 years ago

@toluaina

Imagine a situation where we want to sync table1 and table2 of database_masterto index_master index. Now, for both table1 and table2, there would be single checkpoint file since database and index is same.

Hence, I was suggesting option to having our own checkpoint file name or format like this <database name>_<schema_name>_<table_name>_<index name>.

Please correct me if I am wrong and checkpoint file also has some way to identify the tables.

toluaina commented 2 years ago

PGSync does not sync individual tables. Its syncs a collection of tables in a database. The schema.json only supports checkpoints at the database combined with the index. This is used in combination with the replication slot which is created with the database/index pair.

bxcodec commented 10 months ago

Hi, any update on this? It is still happening now on my side.