ALEX is a ML-enhanced range index, similar in functionality to a B+ Tree. Our implementation is a near drop-in replacement for std::map or std::multimap. You can learn more about ALEX in our SIGMOD 2020 paper.
Getting Started
Design Overview
API Documentation
Contributing
ALEX can be used as a header-only library.
All relevant header files are found in src/core.
You will need to compile your program with at least the C++14 standard (e.g., -std=c++14
).
In this repository, we include three programs that you can compile and run:
On Windows, simply load this repository into Visual Studio as a CMake project. On Linux/Mac, use the following commands:
# Build using CMake, which creates a new build directory
./build.sh
# Build using CMake in Debug Mode, which creates a new build_debug directory and defines the macro ‘DEBUG'
./build.sh debug
# Run example program
./build/example
# Run unit tests
./build/test_alex
To run the benchmark on a synthetic dataset with 1000 normally-distributed keys:
./build/benchmark \
--keys_file=resources/sample_keys.bin \
--keys_file_type=binary \
--init_num_keys=500 \
--total_num_keys=1000 \
--batch_size=1000 \
--insert_frac=0.5
However, to observe the true performance of ALEX, we must run on a much larger dataset. You can download a 200M-key (1.6GB) dataset from Google Drive. To run one example workload on this dataset:
./build/benchmark \
--keys_file=[download location] \
--keys_file_type=binary \
--init_num_keys=10000000 \
--total_num_keys=20000000 \
--batch_size=1000000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats
You can also run this benchmark on your own dataset.
Your keys will need to be in either binary format or text format (one key per line).
If the data type of your keys is not double
, you will need to modify #define KEY_TYPE double
to
#define KEY_TYPE [your data type]
in src/benchmark/main.cpp.
The four datasets we used in our SIGMOD 2020 paper are publicly available (all in binary format):
Like the B+ Tree, ALEX is a data structure that indexes sorted data and supports workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. Internally, ALEX uses a collection of linear regressions, organized hierarchically into a tree, to model the distribution of keys. ALEX uses this model to efficiently search for data records by their key. ALEX also automatically adapts its internal models and tree structure to efficiently support writes.
ALEX is inspired by the original learned index from Kraska et al.. However, that work only supports reads (i.e., point lookups and range queries), while ALEX also efficiently supports write (i.e., inserts, updates, and deletes).
In our paper, we show that ALEX outperforms alternatives in both speed and size:
You can find many more details about ALEX here.
ALEX currently operates in memory, single threaded, and on numerical keys. We are considering ways to add support for persistence, concurrency, and string keys to ALEX.
In terms of performance, ALEX has a couple of known limitations:
We provide three user-facing implementations of ALEX:
ALEX has a few important differences compared to its standard library equivalents:
Detailed API documentation can be found in our wiki.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project follows the Google C++ Style Guide.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.