vinhnd20 commented 3 months ago

Abstract

Data loss on databases can truly be a nightmare for any business, as every bit of data could decide the success or failure of a business campaign or a project. Therefore, ensuring reliability, and high availability for databases is always a top priority in most enterprises. In recent years, DBaaS (Database as a Service) has become more popular than ever, offering a flexible and efficient approach to managing and operating databases. Additionally, many open-source solutions have been developed to help developers deploy DBaaS to their users, notably OpenStack Trove. OpenStack Trove is a key to deploying DBaaS with a robust cloud infrastructure and flexible data management features. However, despite offering data replication solutions, it still has limitations in ensuring the highest reliability and availability for the system. That's why we are looking towards a more advanced solution. In this presentation, we will introduce a new step in providing database services using OpenStack Trove. We will delve into the Semi-Synchronous Replication and High Availability Cluster models, and discuss how to effectively implement them on OpenStack Trove. This solution not only enhances reliability but also increases availability, ensuring data is not lost in the event of system failures. Additionally, it can be deployed across multiple zones or regions, supporting Disaster Recovery for your database system.

Keywords: DBaaS, OpenStack, Cloud computing, data loss, Disaster Recovery, High Availability.

I. Introduction

II. OpenStack

III. Replication , Semi-Synchronous Replication

IV. Implement

V. Test and evaluate

VI. Conclusion

vinhnd20 commented 3 months ago

Đặt vấn đề và đưa ra phương án giải quyết vấn đề. Cần phải có minh chứng, thử nghiệm.

vinhnd20 commented 3 months ago

Trước đoạn cuối trong Introduction thì nêu rõ vấn đề. Đoạn cuối thì đề xúât giải pháp.

vinhnd20 commented 3 months ago

Thứ 7 gửi thầy đến phần intro

crvt4722 commented 3 months ago

Abstract

Database as a Service (DBaaS) is a cloud computing service that allows users to access, manage, and utilize a cloud-based database system without having to purchase, install, configure hardware or software, and set up a management environment for operating the database on their systems (not to mention the real estate costs to provide space for hardware equipment). However, ensuring that user data is always safe and reliable even in the event of a database failure is a persistent issue for major cloud providers today. In this paper, we will explore a new advancement in the deployment of database services using the open-source OpenStack Trove, focusing on the Semi-Synchronous Replication Database Cluster deployment model and the automated disaster recovery feature (High Availability), ensuring that user data is lossless.

Keywords: DBaaS, OpenStack, Semisynchronous replication, High Availability, Disaster recovery, Data loss

I. Introduction

Data loss on databases can truly be a nightmare for any business, as every bit of data could decide the success or failure of a business campaign or a project. Therefore, ensuring reliability, and high availability for databases is always a top priority in most enterprises. In recent years, DBaaS (Database as a Service) has become more popular than ever, offering a flexible and efficient approach to managing and operating databases. Additionally, many open-source solutions have been developed to help developers deploy DBaaS to their users, notably OpenStack Trove. OpenStack Trove is a key to deploying DBaaS with a robust cloud infrastructure and flexible data management features. However, despite offering data replication solutions, it still has limitations in ensuring the highest reliability and availability for the system. That's why we are looking towards a more advanced solution.

In this presentation, we will introduce a new step in providing database services using OpenStack Trove. We will delve into the Semi-Synchronous Replication and High Availability Cluster models, and discuss how to effectively implement them on OpenStack Trove. This solution not only enhances reliability but also increases availability, ensuring data is not lost in the event of system failures. Additionally, it can be deployed across multiple zones or regions, supporting Disaster Recovery for your database system.

II. OpenStack

1. OpenStack architecture

OpenStack is an open-source cloud computing platform capable of supporting both public clouds and private clouds. This platform provides users with a scalable cloud computing infrastructure solution with many outstanding features.

Below is the basic architecture for deploying cloud computing services on OpenStack.

Below is the explanation of these components of the OpenStack architecture:

Horizon (Dashboard): Horizon is a module that provides users with a graphical user interface (GUI) to access, manage, and allocate resources on the OpenStack system.
Nova (Compute): Nova handles requests related to managing virtual machines (VMs) from users (create, delete, modify, etc.), collects resources such as RAM and CPU from the services it manages, and aggregates resources from other services, including Network, Volume, Image, etc., to create and monitor VMs.
Glance (Image): Glance creates OS images (Ubuntu, Windows, etc.), stores, and manages the deletion and modification of image metadata.
Neutron (Network): Neutron creates network groups (which means creating names for networks in a project) and within these network groups, multiple subnets with associated policies are created for VMs to connect to.
Cinder (Block Storage): Cinder creates block storage volumes to provide for VM creation. Since VMs require block storage to operate (at least for OS storage), the OS must be stored on block storage for the VM to boot.
Keystone (Identity): Keystone is the primary authentication service. All user requests to other services must be authenticated through Keystone. Users request a token from Keystone, which returns a token to the user and sends a copy to the service. The user then sends requests to the service with the token until the service accepts the request if it matches the token.
Swift (Object Storage): Swift provides object storage services that can operate independently (like Google Drive, Dropbox, etc.) or be integrated into VMs to provide storage.

This architecture outlines the fundamental components and their interactions within an OpenStack deployment, offering a robust and flexible cloud infrastructure.

2. DBaaS with OpenStack Trove

Trove is Database as a Service for OpenStack. It's designed to run entirely on OpenStack, with the goal of allowing users to quickly and easily utilize the features of a relational or non-relational database without the burden of handling complex administrative tasks. Cloud users and database administrators can provision and manage multiple database instances as needed. Initially, the service will focus on providing resource isolation at high performance while automating complex administrative tasks including deployment, configuration, patching, backups, restores, and monitoring.

Below is the architecture of the components of OpenStack Trove.

In addition to the basic components for deploying the OpenStack infrastructure layer (Nova, Cinder, Swift, Glance, Neutron, Keystone), OpenStack Trove has additional specific components to be able to deploy and operate the Database as a Service, including:

Trove API: Provides a Restful API to send requests for creating, managing, configuring, modifying, deleting, etc., Database Instances.
Message Bus: Acts as a Message Queue Broker to coordinate interactions and communications between the components of Trove.
Trove Taskmanager: Performs and preprocesses "heavy lifting" tasks such as creating, managing the lifecycle, and executing operations inside the Database Instance before sending requests to other components. Trove Taskmanager receives requests from the Trove API.
Guest Agent: A service running on the VM of the Database Instance, receiving requests from the Trove API or Trove Taskmanager to perform tasks such as creating, managing, modifying, and executing operations on the Database.
Trove Conductor: The Conductor is a service running on the server, responsible for receiving requests from the Guest Agent to update information about the Database Instance on the server. For example: Database status, backup information, log files, etc.

Currently, Trove provides data replication features for the types of databases it supports. However, its current model is Asynchronous Replication. Therefore, replication latency can occur during data synchronization between nodes, which can lead to data loss if the database system encounters an issue.

III. Semi-synchronous Replication

Database replication by default is asynchronous. The source writes events to its binary log or WAL and replicas request them when they are ready. The source does not know whether or when a replica has retrieved and processed the transactions, and there is no guarantee that any event ever reaches any replica. With asynchronous replication, if the source crashes, transactions that it has committed might not have been transmitted to any replica. Failover from source to replica in this case might result in failover to a server that is missing transactions relative to the source.

Alt text

With fully synchronous replication, when a source commits a transaction, all replicas have also committed the transaction before the source returns to the session that performed the transaction. Fully synchronous replication means failover from the source to any replica is possible at any time. The drawback of fully synchronous replication is that there might be a lot of delay to complete a transaction.

Semisynchronous replication falls between asynchronous and fully synchronous replication. The source waits until at least one replica has received and logged the events (the required number of replicas is configurable), and then commits the transaction. The source does not wait for all replicas to acknowledge receipt, and it requires only an acknowledgement from the replicas, not that the events have been fully executed and committed on the replica side. Semisynchronous replication therefore guarantees that if the source crashes, all the transactions that it has committed have been transmitted to at least one replica.

Compared to asynchronous replication, semisynchronous replication provides improved data integrity, because when a commit returns successfully, it is known that the data exists in at least two places. Until a semisynchronous source receives acknowledgment from the required number of replicas, the transaction is on hold and not committed.

Compared to fully synchronous replication, semisynchronous replication is faster, because it can be configured to balance your requirements for data integrity (the number of replicas acknowledging receipt of the transaction) with the speed of commits, which are slower due to the need to wait for replicas.

Alt text

IV. Implement

1. Implement Database Enterprise Cluster with 3 nodes

We have deployed an Enterprise Database cluster on 3 nodes configured with Semisynchronous replication, where one node is designated as the master: responsible for the primary read/write operations from the users. The other 2 nodes act as slave nodes: they can be used for read operations from the users and play a crucial role in ensuring that if a failure occurs on the master node, the system can continue to operate by switching over to one of the slave nodes.

2. Implement High Availability solution with automatic failover feature

In a distributed database system, ensuring high availability and fault tolerance is crucial for maintaining seamless operations. Failover processes are an integral part of this system, allowing the database cluster to continue functioning smoothly even in the event of a node failure.

Failover master

This part outlines the steps involved in performing a failover in case of the master node is crashed, ensuring that the most up-to-date replica is promoted to master status and the cluster is restored to normal operation without any data loss. Steps to follow are described below:

Choose the most up-to-date replica as the new master: To find the most up-to-date slave server, a failover script looks at the set of transactions received by the slave (GTID set or WAL position) and compares it with every other slave to find the one that has received the biggest set of transactions.
Promote the selected replica to be the new master: First, we remove the failed master from the cluster. Then, we remove the replica status of the chosen replica from step 1 and activate it as the new master node.
Update the DNS record: After successfully promoting the new master, the DNS record value of the cluster will switch from the old master’s IP to the new master’s IP.
Switch the remaining replica to the new master: Currently, the old replica is still receiving traffic from the failed master. Therefore, when the new master node is successfully activated, the remaining replica will be configured to receive traffic from the new master.
Restore the cluster to normal operation: To ensure the cluster operates normally again, we create a new replica to maintain a total of 3 operational nodes in the cluster by backing up the new master and creating a new replica from that backup.

After completing these steps, the cluster will return to normal operation, ensuring no data loss during the failover process.

Failover replica

In addition to handling the case of a master node failure, dealing with failed replicas is also crucial for ensuring the cluster's stability. The steps to handle a failed replica are described below:

First, we remove the failed replica node from the cluster because if the failed replica becomes operational again, the cluster will become a 4-node model, which can complicate system management.
To restore normal cluster operations, we create a new replica to ensure that there are three nodes operating normally within the cluster. This is done by backing up the new master and creating a new replica from that backup.

Note that in the case of a single replica node failure, the cluster can still operate stably with read and write operations using the remaining two nodes. User services will not be interrupted.

V. Test and evaluate

1. Testing Environment

To evaluate the semi-synchronous replication and high availability solutions using OpenStack Trove, tests were conducted in a controlled environment with the following setup: Database Cluster 3-node with OpenStack deployment and Semisynchronous Replication, Monitoring Tools.

2. Test Scenarios

To comprehensively assess the solution, various test scenarios were devised to simulate different types of failures and measure the system's response and recovery times. The key scenarios included: a. Master Node Failure

Simulating Master Node Failure: The master node was intentionally taken offline to simulate a failure.
Failover Process Observation: The automated failover process was monitored to ensure that one of the replica nodes was promoted to master.
Data Consistency Verification: A check was conducted to confirm that no data was lost during the failover process by comparing data across the new master and remaining replica nodes.
System Recovery Time Measurement: The time taken for the system to return to normal operation, including the creation of a new replica node, was recorded.

b. High Load

Simulation: Applied high read and write loads to the database cluster.
Performance Monitoring: Measured system performance, including response times and replication lag.
System Stability: Ensured the cluster remained operational without failures.
Recovery Process: Observed system behavior and recovery under sustained high load conditions.

3. Evaluation Metrics

The following metrics were used to evaluate the performance and reliability of the implemented solution:

Replication Lag: The time delay between data being committed on the master and replicated on the replicas.
Failover Time: The time taken for the system to detect a failure, promote a new master, and restore normal operations.
Data Consistency: Ensured that data remained consistent across all nodes before and after failover or recovery processes.

4. Test Results

The tests yielded the following results:

Replication Lag: The semi-synchronous replication ensured minimal replication lag, typically under 50 milliseconds, providing near real-time data consistency across nodes.
Failover Time: The automated failover process successfully promoted a new master within 30 seconds of detecting a master node failure. The system returned to full operational status, including the creation of a new replica, within 5 minutes.
Data Consistency: No data loss was observed during any of the failover or recovery processes. Data remained consistent across all nodes.

IV. Conclusion

The implementation of semi-synchronous replication and high availability features using OpenStack Trove significantly enhances the reliability and availability of database services. The automated failover and disaster recovery mechanisms ensure that user data remains safe and accessible even in the event of node failures or network issues. These advancements make OpenStack Trove a robust solution for deploying Database as a Service in a cloud environment, providing enterprises with a reliable and scalable database management platform.

By ensuring minimal replication lag, quick failover times, and consistent data integrity, this solution meets the high standards required for enterprise-grade database services. Future work could focus on further optimizing the failover processes and exploring additional replication models to enhance performance and reliability even further.

References

OpenStack Foundation. (2021). OpenStack Documentation. Retrieved from https://docs.openstack.org/
Oracle. (2020). MySQL 8.0 Reference Manual. Retrieved from https://dev.mysql.com/doc/refman/8.0/en/
Red Hat. (2020). High Availability Guide. Retrieved from https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/high_availability/
Google Cloud. (2021). Cloud SQL High Availability. Retrieved from https://cloud.google.com/sql/docs/mysql/high-availability

vinhnd20 commented 1 month ago

Các việc cần làm

[ ] DBaaS: Làm rõ hơn về tính hiệu quả, lợi ích
[x] Sửa lỗi chính tả, đồng bộ nội dung: Taskmanager, semisynchronous, semi-synchronous
[ ] Template: Sắp xếp và đánh số lại cho đúng chuẩn
[ ] Thêm tài liệu tham khảo: References đến các bài báo, sách
[ ] Vẽ lại các hình ảnh: Cho chỉnh chu, đẹp mắt
[x] Sửa lại đoạn Database Replication: Tránh đạo văn
[ ] Thêm kịch bản test hệ thống: Dưới tình trạng tải (work load) khác nhau, đưa ra con số chính xác, vẽ biểu đồ cột hoặc đồ thị cho dễ nhìn
[x] Thêm kịch bản và kết quả test: Cho trường hợp secondary fail
[ ] Giải thích kỹ hơn kết quả test: Bao gồm cả đồ thị
[x] Đồng bộ bài báo thành Tiếng Anh: Bao gồm cả thông tin tác giả
[x] Đổi tên tiêu đề của các Figure: Cho khoa học hơn

vinhnd20 / graduation_thesis

Enterprise Database Solution for DBaaS with OpenStack Trove: The Ironclad Data Fortress! #1

Abstract

I. Introduction

II. OpenStack

III. Replication , Semi-Synchronous Replication

IV. Implement

V. Test and evaluate

VI. Conclusion

Abstract

I. Introduction

II. OpenStack

1. OpenStack architecture

2. DBaaS with OpenStack Trove

III. Semi-synchronous Replication

IV. Implement

1. Implement Database Enterprise Cluster with 3 nodes

2. Implement High Availability solution with automatic failover feature

Failover master

Failover replica

V. Test and evaluate

1. Testing Environment

2. Test Scenarios

3. Evaluation Metrics

4. Test Results

IV. Conclusion

References

Các việc cần làm