Book: Designing Data-Intensive Applications: Ch3 Storage and Retrieval

Book: Designing Data-Intensive Applications Chapter: 3 (Storage and Retrieval)

Summary:

This chapter deals with how databases handle storage and retrieval, including the mechanics of storing data in a database and querying the database for data.

As application developers, we need to select storage engines that are suitable for our applications. This chapter proposes that in order to squeeze out performance, we need to have a rough idea of the implementation of our selected database under the hood.

We are also introduced to 2 types of storage engines: log-structured storage engines and page-oriented storage engines; which are prefaced with relation to databases that we know well: SQL vs No-SQL.

Then, Klepmann introduces the tradeoff between writes and reads by comparing operations to a simple database which has 2 operations: set and get

set simply writes a pair of id and value to a .csv text file
get retrieves the value based on the id given

set is pretty fast, completed in O(1) time since writes are done to the end of the file. However, get is slow and completed in O(n) time in relation to the number of entries, since one has to go through the entire file to get the value.

By introducing indices, he proposes that we can speed up get at the cost of set. Queries are fast since we can get use indices to quickly get to the right place to retrieve the value, but set is affected since every write needs to then update the indices.

On a high level, we introduced to 2 ideas: storage engines optimised for transaction processing (OLTP), and storage engines optimised for analytics processing (OLAP). The differences between the access patterns are outlined as such:

OLTP are usually user-facing, and often have a high volume of requests. In order to handle the load, applications touch a small number of records in each query. The application requests records on some sort of key, and the engine uses indexing to find data for the key. These DBs are bound by disk seek time.
OLAP are usually for analytics and used in Data warehouses as they are used by business analysts. They handle a much smaller volume of requests, but each query may be very demanding, requiring millions of records to be scanned in a short time. Disk Bandwidth is not the bottleneck, and column-oriented storage is popularly used in these settings

On the side of OLTP:

Log-structured databases only permit appending to files and deleting obsolete files, but written files are not updated (through a process called compaction and merging). Examples of these are databases that use logs-structured merge trees (LSM Trees), such as Cassandra, Bitcask, Lucene and HBase
Page-oriented databases treat the disk as a set of fixed-size pages that can be overwritten, and B-Trees are the main examples of the data structures used.

In Log-structured databases, random-access writes are turned into sequential-writes, which allows for higher write throughput.

Kleppman then walks through the architecture of OLAP like Data Warehousing engines. With OLAP, the access patterns are different as a large number of records are scanned with only a few columns per record read, and aggregate statistics computed. Queries also often require sequential scanning across large number of rows, which make indices less relevant. Data compaction is more preferred, to minimize amount of data query needs to read from disk. Column-oriented storage helps with this goal.

nus-cs3281 / 2024

Book: Designing Data-Intensive Applications: Ch3 Storage and Retrieval #61

Summary: