nus-cs3281 / 2024

1 stars 2 forks source link

Book: Designing Data-Intensive Applications: Ch3 Storage and Retrieval #61

Open vigneshsankariyer1234567890 opened 3 months ago

vigneshsankariyer1234567890 commented 3 months ago

Book: Designing Data-Intensive Applications Chapter: 3 (Storage and Retrieval)

Summary:

This chapter deals with how databases handle storage and retrieval, including the mechanics of storing data in a database and querying the database for data.

As application developers, we need to select storage engines that are suitable for our applications. This chapter proposes that in order to squeeze out performance, we need to have a rough idea of the implementation of our selected database under the hood.

We are also introduced to 2 types of storage engines: log-structured storage engines and page-oriented storage engines; which are prefaced with relation to databases that we know well: SQL vs No-SQL.

Then, Klepmann introduces the tradeoff between writes and reads by comparing operations to a simple database which has 2 operations: set and get

set is pretty fast, completed in O(1) time since writes are done to the end of the file. However, get is slow and completed in O(n) time in relation to the number of entries, since one has to go through the entire file to get the value.

By introducing indices, he proposes that we can speed up get at the cost of set. Queries are fast since we can get use indices to quickly get to the right place to retrieve the value, but set is affected since every write needs to then update the indices.

On a high level, we introduced to 2 ideas: storage engines optimised for transaction processing (OLTP), and storage engines optimised for analytics processing (OLAP). The differences between the access patterns are outlined as such:

  1. OLTP are usually user-facing, and often have a high volume of requests. In order to handle the load, applications touch a small number of records in each query. The application requests records on some sort of key, and the engine uses indexing to find data for the key. These DBs are bound by disk seek time.
  2. OLAP are usually for analytics and used in Data warehouses as they are used by business analysts. They handle a much smaller volume of requests, but each query may be very demanding, requiring millions of records to be scanned in a short time. Disk Bandwidth is not the bottleneck, and column-oriented storage is popularly used in these settings

On the side of OLTP:

In Log-structured databases, random-access writes are turned into sequential-writes, which allows for higher write throughput.

Kleppman then walks through the architecture of OLAP like Data Warehousing engines. With OLAP, the access patterns are different as a large number of records are scanned with only a few columns per record read, and aggregate statistics computed. Queries also often require sequential scanning across large number of rows, which make indices less relevant. Data compaction is more preferred, to minimize amount of data query needs to read from disk. Column-oriented storage helps with this goal.