microsoft / ALEX

A library for building an in-memory, Adaptive Learned indEX
MIT License
656 stars 112 forks source link

Introduction

ALEX is a ML-enhanced range index, similar in functionality to a B+ Tree. Our implementation is a near drop-in replacement for std::map or std::multimap. You can learn more about ALEX in our SIGMOD 2020 paper.

Table of Contents

Getting Started
Design Overview
API Documentation
Contributing

Getting Started

ALEX can be used as a header-only library. All relevant header files are found in src/core. You will need to compile your program with at least the C++14 standard (e.g., -std=c++14).

In this repository, we include three programs that you can compile and run:

On Windows, simply load this repository into Visual Studio as a CMake project. On Linux/Mac, use the following commands:

# Build using CMake, which creates a new build directory
./build.sh

# Build using CMake in Debug Mode, which creates a new build_debug directory and defines the macro ‘DEBUG'
./build.sh debug

# Run example program
./build/example

# Run unit tests
./build/test_alex

To run the benchmark on a synthetic dataset with 1000 normally-distributed keys:

./build/benchmark \
--keys_file=resources/sample_keys.bin \
--keys_file_type=binary \
--init_num_keys=500 \
--total_num_keys=1000 \
--batch_size=1000 \
--insert_frac=0.5

However, to observe the true performance of ALEX, we must run on a much larger dataset. You can download a 200M-key (1.6GB) dataset from Google Drive. To run one example workload on this dataset:

./build/benchmark \
--keys_file=[download location] \
--keys_file_type=binary \
--init_num_keys=10000000 \
--total_num_keys=20000000 \
--batch_size=1000000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats

You can also run this benchmark on your own dataset. Your keys will need to be in either binary format or text format (one key per line). If the data type of your keys is not double, you will need to modify #define KEY_TYPE double to #define KEY_TYPE [your data type] in src/benchmark/main.cpp.

Datasets

The four datasets we used in our SIGMOD 2020 paper are publicly available (all in binary format):

Design Overview

Like the B+ Tree, ALEX is a data structure that indexes sorted data and supports workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. Internally, ALEX uses a collection of linear regressions, organized hierarchically into a tree, to model the distribution of keys. ALEX uses this model to efficiently search for data records by their key. ALEX also automatically adapts its internal models and tree structure to efficiently support writes.

ALEX is inspired by the original learned index from Kraska et al.. However, that work only supports reads (i.e., point lookups and range queries), while ALEX also efficiently supports write (i.e., inserts, updates, and deletes).

In our paper, we show that ALEX outperforms alternatives in both speed and size:

You can find many more details about ALEX here.

Limitations and Future Research Directions

ALEX currently operates in memory, single threaded, and on numerical keys. We are considering ways to add support for persistence, concurrency, and string keys to ALEX.

In terms of performance, ALEX has a couple of known limitations:

API Documentation

We provide three user-facing implementations of ALEX:

  1. AlexMap is a near drop-in replacement for std::map.
  2. AlexMultiMap is a near drop-in replacement for std::multimap.
  3. Alex is the internal implementation that supports both AlexMap and AlexMultimap. It exposes slightly more functionality.

ALEX has a few important differences compared to its standard library equivalents:

Detailed API documentation can be found in our wiki.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project follows the Google C++ Style Guide.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.