sidnt / lmdz

0 stars 0 forks source link

hyc @ cmu #31

Open sidnt opened 4 years ago

sidnt commented 4 years ago

readers never block all data stored in a memory mapped file structure is completely crash proof extremely compact object

runs on just about everything raspberrypi linux android macos ios solaris windows mainframes

allows both multiprocess and multithreaded access to a single database the concurrency model is a single writer and multiple readers

since reads don't take any locks readers scale to arbitrarily many cpus readers scale perfectly linearly with available cpus

because it has a single writer design, it has no deadlocks it is serializable isolation

support nested transactions support batched writes = basically just another transaction


because we never overwrite live data the structure on disk can never be corrupted by a system crash you crash you come right back up and everything is ready to run


concept of single level store - you treat memory and disk as if they are the same by using a memory mapped file when you fetch data from the database reads come directly out of the mmap it's completely zero copy

optionally, we can do writes directly to the memory map but that's not the default behaviour but when you do use writes directly on the memory map that also means there's zero copy for your writes there's no memory buffering in the application

because we are using the memory map .1 everything relies on the file system cache provided by the os .2 (!) you can actually store live objects programming language objects directly into the database and use them directly

--

we do have to configure with lmdb is a max database size that just sets the size of mmap that we're gonna use

- image

this is a test using .. 100 byte records several times faster for small records this is in memory

pay attention to the size of domain objects

the left side is sequential reads and the right side is random reads

sequential reads are much fast than random reads


image

this is the same test using 100KiB of values y axis is ops/s

now this is directly a result of our zero copy reads because we do zero copy, that means we can do very fast operations, independent of the data size. . and all of the other databases slow down as your record sizes increase

audience: so you're saying that a 100KiB read that lands in the memory of a process on the same node, can be done in essentially zero time by mapping? yes because if i turn that into bytes/second moving off the machine, it wouldn't be possible.


different filesystems have different impact on your database performance obv, the type of storage medium will have a huge impact, HDD vs SSD

for sequential writes, which is the most difficult workload for us, the fastest is JFS, ntfs in pretty bad second to JFS is actually ext2


data fetches are satisfied by direct reference to the memory map; there is no intermediate page or buffer cache

single level store it only works if all your data fits in your address space.


lmdb, we use a readonly memory map by default one of the reasons to do that is because it will protect you from bugs in your code

if you have a writable memory map and all of your datastructures are living in this thing and someone does a stray write over memory, your database will be corrupted and you won't even know, when or where it happened

you can of course use a writable map i suggest you only do this after your app has been well tested


we do copy on write so that we never modify anything that's in use

also, readers always see a completely isolated consistent snapshot



imagine lmdb, actually writing one transaction, in its b+ tree. it will be making a new tree with the changes, bottoms up, while the readers are still being redirected into the b+tree from the previous root node. this transaction, if it completes successfully, it would have constructed a new root node, and from that moment on, it will be written into the b+ tree such that, now readers go through this new root node, into whatever part of the tree they wish to read.