rhizome.db growing without shrinking through meshms flood

gh0st42 commented 8 years ago

While testing the new version of servald we encountered a big bug regarding sqlite and meshms.

We flooded one machine from another with 32000 messages, after a few messages we can only send one message per second and a new file entry for the message is created in the database but the old journal is not removed, so each ne entry is as big as the old one plus the new size. After a certain rhizome.db size is reached blobs are written to the filesystem.

A simple script to flood another host with messages can be found here: https://github.com/umr-ds/serval-tests/blob/master/meshms-flood

Even after a few hundred mesages you can see that every few seconds the database grows another megabyte even though the messages are only 53 bytes long.

Using restful api or commandline doesn't make a difference.

lakeman commented 8 years ago

Journal bundles use less network bytes to transfer, but we currently re-write the whole blob on the filesystem of each node. Mostly so we can re-hash the payload bytes and commit the new version atomically (ish). Changing that is not simple.

We could impose a meshms ply size limit and advance the tail of the bundle. A feature of journals that we've designed and discussed, but haven't built into the client API yet.

We've also planned for multiple rhizome bundles to be created for the same file hash. So we run a garbage collection process every 30-minutes (ish) to clean out any orphans. https://github.com/servalproject/serval-dna/blob/ebb7500119d9efab331ddd9eb1817ea08b23c5ab/server.c#L586 Which might need to run on some other trigger, and probably isn't being tested very well.

We've also wanted to build a new storage layer for some time with a number of technical improvements;

hash files on block boundaries
discover and record object deltas
store and transfer more complex object graphs
move away from sqlite to a mem-mapped b-tree of some kind
support multiple storage devices with hotplug removal
shift all I/O out of the main servald process

In other words, a better git object store

I keep wanting to start this, but we haven't had a pressing need or the budget to do this yet.

On Tue, May 17, 2016 at 10:24 PM, gh0st42 notifications@github.com wrote:

While testing the new version of servald we encountered a big bug regarding sqlite and meshms.

We flooded one machine from another with 32000 messages, after a few messages we can only send one message per second and a new file entry for the message is created in the database but the old journal is not removed, so each ne entry is as big as the old one plus the new size. After a certain rhizome.db size is reached blobs are written to the filesystem.

A simple script to flood another host with messages can be found here: https://github.com/umr-ds/serval-tests/blob/master/meshms-flood

Even after a few hundred mesages you can see that every few seconds the database grows another megabyte even though the messages are only 53 bytes long.

Using restful api or commandline doesn't make a difference.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/servalproject/serval-dna/issues/106

gh0st42 commented 8 years ago

Thanks for your explanation so far. The problem is at the moment we have in the database something like that:

msg1
msg1 + msg2
msg1 + msg2 + msg3 ...

And all versions are kept wasting soo much disk space (for 3 messages, 6 are kept on disk). A few kilobytes of text can end up in severals tens or even hundreds of megabytes on disk. In our test a 95kb conversation produced a 111mb rhizome.db :D

This keeps growing and growing, this doesn't really scale well. So at least for our paper it makes it hard so sell, because large local communities using MeshMS as a daily communication system will come to a big storage problem on their nodes in a short amount of time. A few hundred messages per conversation is not really much if you look at usage statistics from current messengers. Especially during an emergency people will write even more messages in a shorter time because of panic, eyewitness reports, contacting family etc.. Taking the small router space and precious storage on mobile devices into account this is a big problem! And we're not even talking about malicious individuals or hackers here..

Sqlite might cause some problems and make stuff slow but here the problem ist more the concept itself, storing the same data directly on disk doesn't help either. I understand what you wanted to achieve here but the trade off of network bytes vs disk space imho is not working here. If i have to store roughly 2kb to transmit 7 message a 53 byte (=371 byte + overhead = 483 bytes real data) which gets even worse over time I might be better off transmitting the whole 371 bytes over the wire/air at least with bluetooth or wifi links. Sure modern computers have lots of disk space (our tp-link router don't:( ) but while testing I had 35GB of blobs in my database just for one long conversation.

Also I'm not sure if the garbage collection will help that much, timing problems aside, getting rid of the oldest entries is getting rid of the smallest. Even if we only keep the 3 newest per conversation, once the database has reached a significant size (long term use) the historic copies will also be quite large. But short term we can reduce a 100mb database back to 1mb or something like that which would be good for the moment.

Getting back to finding a solution: Would it be possible to have the journal as kind of a meta-file in rhizome? Being actually a linked list where data gets appended and individual entries can be sent? So from the outside there is just one file in the database for this conversation but it has several small encrypted portions (single messages) in correct order. This way one could bulk request tail -3 entries but we wouldn't need to keep every combination. Probably missed something here...

I don't like complaining to someone else without real solutions myself but this problem is probably none that can be fixed with a few lines of code and has big implications on longterm use for larger communities or selling it in our disaster scenarios.

gh0st42 commented 8 years ago

Also wouldn't some kind of ACK from the conversation parties be enough to discard messages both parties have already received? No need for all nodes in the network to keep the history throughout eternity. If a new node comes by that has old messages it can discard them as soon has he also sees the ACK and middle nodes only need to keep a journal of what hasn't been ACK'd so far. Would be a bit more management and meta information that need to be distributed but could help in the long run to keep the network clean and in a working state...

lakeman commented 8 years ago

I'm not disagreeing with you. Running;

$ servald rhizome clean

Will hopefully tidy everything up. Which is what we try to run every 30 minutes. Clearly for your test case this isn't often enough. Triggering a cleanup based on some kind of Used / Free ratio somewhere in; https://github.com/servalproject/serval-dna/blob/development/rhizome_store.c#L197 will probably help a lot.

Deleting orphan payloads shortly after the manifest is replaced will probably help too. But there's another reason we delay removing old payloads. If one of our neighbours is fetching the current version, and a new version arrives, we want to ensure that we can complete the delivery of the current version. If you have a bundle like this that might be rapidly changing, it's better to complete each transfer than to abort it because you have a newer version.

Solving all of these issues without creating new ones is complicated. I would much rather nuke it from orbit and start again (Plan A).

Anyway, on to Plan B. We teach the rhizome store layer to handle journal bundles differently;

1) Just before we finish writing journal payloads; https://github.com/servalproject/serval-dna/blob/development/rhizome_store.c#L722 Save the hash state "somewhere" with the payload. (Probably need to be careful about library versions, CPU endianness & struct field alignment)

2) When we open the journal again; https://github.com/servalproject/serval-dna/blob/development/rhizome_store.c#L1551 If advance_by == 0 && copy_length > 0. Try to load the previous hash state. If that works, try to hard-link the existing payload file to the new temporary filename and seek to the file offset of the previous manifest.

If we can't create a new link on this filesystem (errno==EXDEV?), there's still a way to save space. But things are a bit more complicated. For other errors, just fall back to the current code path.

I think that should do it. Of course we'll need some test cases...

lakeman commented 8 years ago

Also note that using the meshms command line at the same time that rhizome synchronisation is occurring will cause delay and perhaps failures due to database locking.

While using curl to send messages via the restful API may be adding a 1 second delay. By default curl sends an "Expect:" header and waits for a "100 - Continue" response. You can avoid this delay by adding a '-H "Expect:"' argument. I've just added this to our existing test cases.

servalproject / serval-dna

rhizome.db growing without shrinking through meshms flood #106