toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.1k stars 172 forks source link
change-data-capture elasticsearch elasticsearch-sync etl kibana opensearch postgresql python sql

PGSync

PyPI version Build status Python versions Downloads codecov

PostgreSQL to Elasticsearch/OpenSearch sync

PGSync is a middleware for syncing data from Postgres to Elasticsearch/OpenSearch effortlessly. It allows you to keep Postgres as your source of truth and expose structured denormalized documents in Elasticsearch/OpenSearch.

Changes to nested entities are propagated to Elasticsearch/OpenSearch. PGSync's advanced query builder then generates optimized SQL queries on the fly based on your schema. PGSync's advisory model allows you to quickly move and transform large volumes of data quickly whilst maintaining relational integrity.

Simply describe your document structure or schema in JSON and PGSync will continuously capture changes in your data and load it into Elasticsearch/OpenSearch without writing any code. PGSync transforms your relational data into a structured document format.

It allows you to take advantage of the expressive power and scalability of Elasticsearch/OpenSearch directly from Postgres. You don't have to write complex queries and transformation pipelines. PGSync is lightweight, flexible and fast.

Elasticsearch/OpenSearch is more suited as as secondary denormalised search engine to accompany a more traditional normalized datastore. Moreover, you shouldn't store your primary data in Elasticsearch/OpenSearch.

So how do you then get your data into Elasticsearch/OpenSearch in the first place? Tools like Logstash and Kafka can aid this task but they still require a bit of engineering and development.

Extract Transform Load and Change data capture tools can be complex and require expensive engineering effort.

Other benefits of PGSync include:

Why?

At a high level, you have data in a Postgres database and you want to mirror it in Elasticsearch/OpenSearch.
This means every change to your data (Insert, Update, Delete and Truncate statements) needs to be replicated to Elasticsearch/OpenSearch. At first, this seems easy and then it's not. Simply add some code to copy the data to Elasticsearch/OpenSearch after updating the database (or so called dual writes). Writing SQL queries spanning multiple tables and involving multiple relationships are hard to write. Detecting changes within a nested document can also be quite hard. Of course, if your data never changed, then you could just take a snapshot in time and load it into Elasticsearch/OpenSearch as a one-off operation.

PGSync is appropriate for you if:

How it works

PGSync is written in Python (supporting version 3.8 onwards) and the stack is composed of: Redis, Elasticsearch/OpenSearch, Postgres, and SQlAlchemy.

PGSync leverages the logical decoding feature of Postgres (introduced in PostgreSQL 9.4) to capture a continuous stream of change events. This feature needs to be enabled in your Postgres configuration file by setting in the postgresql.conf file:

wal_level = logical

You can select any pivot table to be the root of your document.

PGSync's query builder builds advanced queries dynamically against your schema.

PGSync operates in an event-driven model by creating triggers for tables in your database to handle notification events.

This is the only time PGSync will ever make any changes to your database.

NOTE: If you change the structure of your PGSync's schema config, you would need to rebuild your Elasticsearch/OpenSearch indices. There are plans to support zero-downtime migrations to streamline this process.

Quickstart

There are several ways of installing and trying PGSync

Running in Docker

To startup all services with docker. Run:

$ docker-compose up

Show the content in Elasticsearch/OpenSearch

$ curl -X GET http://[Elasticsearch/OpenSearch host]:9201/reservations/_search?pretty=true
Manual configuration

Features

Key features of PGSync are:

Requirements

Example

Consider this example of a Book library database.

Book

isbn (PK) title description
9785811243570 Charlie and the chocolate factory Willy Wonka’s famous chocolate factory is opening at last!
9788374950978 Kafka on the Shore Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami.
9781471331435 1984 1984 was George Orwell’s chilling prophecy about the dystopian future.

Author

id (PK) name
1 Roald Dahl
2 Haruki Murakami
3 Philip Gabriel
4 George Orwell

BookAuthor

id (PK) book_isbn author_id
1 9785811243570 1
2 9788374950978 2
3 9788374950978 3
4 9781471331435 4

With PGSync, we can simply define this JSON schema where the book table is the pivot. A pivot table indicates the root of your document.

{
    "table": "book",
    "columns": [
        "isbn",
        "title",
        "description"
    ],
    "children": [
        {
            "table": "author",
            "columns": [
                "name"
            ]
        }
    ]
}

To get this document structure in Elasticsearch/OpenSearch

[
  {
      "isbn": "9785811243570",
      "title": "Charlie and the chocolate factory",
      "description": "Willy Wonka’s famous chocolate factory is opening at last!",
      "authors": ["Roald Dahl"]
  },
  {
      "isbn": "9788374950978",
      "title": "Kafka on the Shore",
      "description": "Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami",
      "authors": ["Haruki Murakami", "Philip Gabriel"]
  },
  {
      "isbn": "9781471331435",
      "title": "1984",
      "description": "1984 was George Orwell’s chilling prophecy about the dystopian future",
      "authors": ["George Orwell"]
  }
]

Behind the scenes, PGSync is generating advanced queries for you such as.

SELECT 
       JSON_BUILD_OBJECT(
          'isbn', book_1.isbn, 
          'title', book_1.title, 
          'description', book_1.description,
          'authors', anon_1.authors
       ) AS "JSON_BUILD_OBJECT_1",
       book_1.id
FROM book AS book_1
LEFT OUTER JOIN
  (SELECT 
          JSON_AGG(anon_2.anon) AS authors,
          book_author_1.book_isbn AS book_isbn
   FROM book_author AS book_author_1
   LEFT OUTER JOIN
     (SELECT 
             author_1.name AS anon,
             author_1.id AS id
      FROM author AS author_1) AS anon_2 ON anon_2.id = book_author_1.author_id
   GROUP BY book_author_1.book_isbn) AS anon_1 ON anon_1.book_isbn = book_1.isbn

You can also configure PGSync to rename attributes via the schema config e.g

  {
      "isbn": "9781471331435",
      "this_is_a_custom_title": "1984",
      "desc": "1984 was George Orwell’s chilling prophecy about the dystopian future",
      "contributors": ["George Orwell"]
  }

PGSync addresses the following challenges:

Benefits

Contributing

Contributions are very welcome! Check out the Contribution Guidelines for instructions.

Credits

License

This project is licensed under the terms of the MIT license. Please see LICENSE for more details.

You should have received a copy of the MIT License along with PGSync.
If not, see https://opensource.org/license/mit/.