Requirements (as listed in the DBLog paper. I believe that it's also applicable to a wider scope of uses cases)
trigger the full state capture at any point in time. That is because the full state may not only be needed initially and may be needed at any time afterwards. For instance if the database is restored from a backup or for repairs if there is data loss or corruption downstream. There are also cases where only a subset of data needs to be repaired, for example if a specific set of rows has been identified to be corrupt downstream.
pause or resume at any time so that full state capture does not need to start from the beginning for large tables after restarting the process.
capture transaction log events and the full state side by side without stalling one or the other. There are use cases that require high availability of transaction log events so that the replication lag to the source is kept to a minimum.
prevent time-travel, by preserving the order of history when transmitting events to a derived datastore. This way an earlier version of a row (like the residential address of a member account) is not delivered after a later version. Hence, a solution had to combine transaction log events and the full state in a way that preserves the history of changes.
offer this as a platform. Hence it was crucial to minimize the impact on the source database. Otherwise this can hinter adoption of the platform, especially for use cases that have high traffic. In that regard we want to avoid primitives such as table locks which can block application write traffic.
function across a variety of Relational Database Management Systems (RDMBS), such as MySQL, PostgreSQL, Aurora etc, that we use in production. In order to achieve that we wanted to avoid using vendor specific features.
The original DBLog paper highlights 3 typical use cases:
Heterogeneous data replication: In order to keep track of productions it is crucial to search across all data that are related to movies. This involves data that is managed by separate teams, each of which is owning distinct business entities such as episodes, talents, and deals. These services use MySQL or PostgreSQL in AWS RDS to store their data. DBLog is deployed to each of the involved datastores and captures the full data set and real-time changes into an output stream. The streams are then joined and ingested into a common search index in ElasticSearch, providing search across all involved entities.
Database activity logging: DBLog is also used to log database activity, so that it can be used to inspect what kind of changes occur on the database. In this scenario, changed rows are captured and delivered to a stream. A stream processor is then propagating the events to ElasticSearch (for short-term storage) and Hive (for long-term storage). Kibana is used in ElasticSearch to build activity dashboards so that teams can inspect the amount of occurred operations per table. This is used to inspect data mutation patterns and can be crucial to detect unexpected patterns such drop of inserts to a table after a new service code rolled out with a bug.
Schema migration: When a team is migrating one MySQL database to another and a new table structure is used in the second database. DBLog is deployed on the old database both to capture the full state as well as new changes as they occur and writes them to a stream. A Flink job is then consuming that data, transforms them to the new table schema format and writes them into the new database. This way, reads for the new database can be verified upfront by running on the populated new schema, while writes still occur to the old schema. In a follow up step, write traffic can also occur to the new schema and traffic on the old database can be stopped