readers never block
all data stored in a memory mapped file
structure is completely crash proof
extremely compact object
runs on just about everything
raspberrypi linux android macos ios solaris windows mainframes
allows both multiprocess and multithreaded access to a single database
the concurrency model is a single writer and multiple readers
since reads don't take any locks
readers scale to arbitrarily many cpus
readers scale perfectly linearly with available cpus
because it has a single writer design, it has no deadlocks
it is serializable isolation
support nested transactions
support batched writes = basically just another transaction
because we never overwrite live data
the structure on disk can never be corrupted by a system crash
you crash you come right back up
and everything is ready to run
concept of single level store - you treat memory and disk as if they are the same
by using a memory mapped file
when you fetch data from the database
reads come directly out of the mmap
it's completely zero copy
optionally, we can do writes directly to the memory map
but that's not the default behaviour
but when you do use writes directly on the memory map
that also means there's zero copy for your writes
there's no memory buffering in the application
because we are using the memory map
.1
everything relies on the file system cache
provided by the os
.2 (!)
you can actually store live objects
programming language objects
directly into the database
and use them directly
--
we do have to configure with lmdb
is a max database size
that just sets the size of mmap
that we're gonna use
-
this is a test using .. 100 byte records
several times faster for small records
this is in memory
pay attention to the size of domain objects
the left side is sequential reads
and the right side is random reads
sequential reads are much fast than random reads
this is the same test using 100KiB of values
y axis is ops/s
now this is directly a result of our zero copy reads
because we do zero copy, that means
we can do very fast operations, independent of the data size.
.
and all of the other databases slow down
as your record sizes increase
audience: so you're saying that a 100KiB read that lands in the memory of a process on the same node, can be done in essentially zero time by mapping? yes
because if i turn that into bytes/second moving off the machine, it wouldn't be possible.
different filesystems have different impact on your database performance
obv, the type of storage medium will have a huge impact, HDD vs SSD
for sequential writes, which is the most difficult workload for us,
the fastest is JFS,
ntfs in pretty bad
second to JFS is actually ext2
data fetches are satisfied by direct reference to the memory map; there is no intermediate page or buffer cache
single level store
it only works if all your data fits in your address space.
lmdb, we use a readonly memory map by default
one of the reasons to do that is because
it will protect you from bugs in your code
if you have a writable memory map
and all of your datastructures are living in this thing
and someone does a stray write over memory,
your database will be corrupted
and you won't even know, when or where it happened
you can of course use a writable map
i suggest you only do this after your app has been well tested
we do copy on write
so that we never modify anything that's in use
also, readers always see a completely isolated consistent snapshot
imagine lmdb, actually writing one transaction, in its b+ tree. it will be making a new tree with the changes, bottoms up, while the readers are still being redirected into the b+tree from the previous root node. this transaction, if it completes successfully, it would have constructed a new root node, and from that moment on, it will be written into the b+ tree such that, now readers go through this new root node, into whatever part of the tree they wish to read.
readers never block all data stored in a memory mapped file structure is completely crash proof extremely compact object
runs on just about everything raspberrypi linux android macos ios solaris windows mainframes
allows both multiprocess and multithreaded access to a single database the concurrency model is a single writer and multiple readers
since reads don't take any locks readers scale to arbitrarily many cpus readers scale perfectly linearly with available cpus
because it has a single writer design, it has no deadlocks it is serializable isolation
support nested transactions support batched writes = basically just another transaction
because we never overwrite live data the structure on disk can never be corrupted by a system crash you crash you come right back up and everything is ready to run
concept of single level store - you treat memory and disk as if they are the same by using a memory mapped file when you fetch data from the database reads come directly out of the mmap it's completely zero copy
optionally, we can do writes directly to the memory map but that's not the default behaviour but when you do use writes directly on the memory map that also means there's zero copy for your writes there's no memory buffering in the application
because we are using the memory map .1 everything relies on the file system cache provided by the os .2 (!) you can actually store live objects programming language objects directly into the database and use them directly
--
we do have to configure with lmdb is a max database size that just sets the size of mmap that we're gonna use
-
this is a test using .. 100 byte records several times faster for small records this is in memory
the left side is sequential reads and the right side is random reads
this is the same test using 100KiB of values y axis is ops/s
now this is directly a result of our zero copy reads because we do zero copy, that means we can do very fast operations, independent of the data size. . and all of the other databases slow down as your record sizes increase
audience: so you're saying that a 100KiB read that lands in the memory of a process on the same node, can be done in essentially zero time by mapping? yes because if i turn that into bytes/second moving off the machine, it wouldn't be possible.
different filesystems have different impact on your database performance obv, the type of storage medium will have a huge impact, HDD vs SSD
for sequential writes, which is the most difficult workload for us, the fastest is JFS, ntfs in pretty bad second to JFS is actually ext2
data fetches are satisfied by direct reference to the memory map; there is no intermediate page or buffer cache
single level store it only works if all your data fits in your address space.
lmdb, we use a readonly memory map by default one of the reasons to do that is because it will protect you from bugs in your code
if you have a writable memory map and all of your datastructures are living in this thing and someone does a stray write over memory, your database will be corrupted and you won't even know, when or where it happened
you can of course use a writable map i suggest you only do this after your app has been well tested
we do copy on write so that we never modify anything that's in use
also, readers always see a completely isolated consistent snapshot
imagine lmdb, actually writing one transaction, in its b+ tree. it will be making a new tree with the changes, bottoms up, while the readers are still being redirected into the b+tree from the previous root node. this transaction, if it completes successfully, it would have constructed a new root node, and from that moment on, it will be written into the b+ tree such that, now readers go through this new root node, into whatever part of the tree they wish to read.