oetiker / rrdtool-2.x

RRDtool 2.x - The Time Series Database
86 stars 8 forks source link

Minimize storage I/O #13

Open jfesler opened 11 years ago

jfesler commented 11 years ago

The single biggest cost with deploying RRD in my experience has been the I/O cost of a single .rrd file. Seek,Read header, seek, write header, and one or more (seek RRA + offset, write).

Smashing more DS's into a single RRD helps, to an extent; but there is a tradeoff on flexibility by doing that. Particularly when your logical breakdown of what goes into a single .rrd needs to add or remove DS's, to accommodate changes in the instrumentation of applications.

Likewise, reducing the number of RRA's helps with some of the cost. A single RRA of 1 minute data for 1 year, reduces the write costs; but it does have a penalty on graphing long periods of time later.

Throwing hardware at it solves it only so far. RAID10 with fast drives helps. But even with that, some of our deployments still require 20 or more RRD servers. The followup problem from distributing the data to so many servers then becomes, bringing the data back to one place for a single combined graph.

While these problems exist no matter what; some emphasis on the I/O cost of a single update may be able to help alleviate this problem for a number of customers; it may at least lower the capex costs for the rest.

oetiker commented 11 years ago

any thoughts on how to optimize ? one thing I think of looking at is to somehow work more with 4k blocks ... one I dea would be to use the first 4k of the rrd for the headers and a small 'journal' where data is accumulated to reduce the number of times the multiple blocks have to be written in the rrd file ...

jfesler commented 11 years ago

As I was writing this earlier, the thought of journaling did come to mind. At least that would reduce most updates to seek, read, seek, write - with the "occasional" (by some definition) RRA aggregation. I would imagine this would add quite a bit of complexity to fetch and graph, unless they took a simple approach of flushing the journal before fetching data. Journaling in this matter would benefit of being NFS-accessible, to handle the multi-node RRD clusters.

If RRD became a network service (issue #14 just filed separately to keep #13 brief), it could incorporate a high level of write deferment pretty easily. I see a lot of other value with #14 as well, from a graphing and UI perspective.

oetiker commented 11 years ago

I would imagine, that by having a clearly separated data storage layer as stipulated in another issue, the complexity of the journal handling could be nicely packed away ...

luqasz commented 10 years ago

How about reading whole rrd file into memory ? Then you could make changes in memory witch almost doesn't suffer from any i/o hell. After all operations, just write whole file back onto filesystem. This approach mey even reduce i/o hell in existing implementation of rrd. Nowadays memory is verry cheap.

jfesler commented 10 years ago

On Mon, Dec 9, 2013 at 12:48 PM, uqasz notifications@github.com wrote:

How about reading whole rrd file into memory ? Then you could make changes in memory witch almost doesn't suffer from any i/o hell. After all operations, just write whole file back onto filesystem. This approach mey even reduce i/o hell in existing implementation of rrd. Nowadays memory is verry cheap.

I've done this when ramdisk based file systems - without any need for RRD itself to try and do this. The use case was for a small group of files (small enough to fit ram, compared to the usual set) that would be updated once a second instead of once a minute; particularly while trying to debug some specific system performance issues.

Do we really need RRD to try and do this, instead of letting OS features do this?

Jason Fesler, email/jabber jfesler@gigo.com resume: http://jfesler.com "Give a man fire, and he'll be warm for a day; set a man on fire, and he'll be warm for the rest of his life."

oetiker commented 10 years ago

There is a feature idea for rrdtool to NOT write out the rrd updates immediately but rather cache them in memory. Similar to what rrdcached does today except that it would not cache the input but rather the output before it is written to disk ... this feature would require rrdtool operate as a daemon but the cool thing would be that when asked to provide data it would be able to draw on its memory cache on top of the data read from disk and would thus not be required to flush everything befor being able to satisfy fetch requests.

luqasz commented 10 years ago

now that would be a really nice feature. my previous proposition was just a loose thought. i haven't done any benchmarks. i just thought that if disks are optimised for bigger sets of data rrdtool would use it.

rkubica commented 10 years ago

On Mon, Dec 9, 2013 at 12:48 PM, uqasz notifications@github.com wrote: How about reading whole rrd file into memory ? Then you could make changes in memory witch almost doesn't suffer from any i/o hell. After all operations, just write whole file back onto filesystem. This approach mey even reduce i/o hell in existing implementation of rrd. Nowadays memory is verry cheap.

jfesler commented: I've done this when ramdisk based file systems - without any need for RRD itself to try and do this. [...] Do we really need RRD to try and do this, instead of letting OS features do this?

Linux does this quite efficiently without requiring a ramdisk, just tune vm:

  vm.dirty_ratio = 60
  vm.dirty_background_ratio = 50
  vm.dirty_writeback_centisecs = 3000
  vm.dirty_expire_centisecs = 720000

which will dirty buffer all pages for 2 hours; ie, 1 disk write for X updates over 2 hours or 4096/8 updates which ever is first.

reduces the memory requirement to 4k header 4k per rra. granted the system needs the free memory to do this but memory is cheap and this is a huge gain.

luqasz commented 10 years ago

thx for this information. this would be really nice to heve this information on rrdtool wiki/docs somewhere. i will read more on this settings.

pabigot commented 10 years ago

If rrdtool-2.x imposes a requirement that the host system support POSIX mmap(), then #29, #13, and all the caching issues of #14 become nearly trivial: open each database as a shared memory-mapped file and interact with it as a data object in memory.

If that requirement is not present then you need to worry about journaling, inter-process cache management, and a bunch of other stuff that the OS is almost certainly going to do already and better.

If rrdtool-2.x needs to work on systems that don't provide mmap, then what are the expectations for the host environment for rrdtool-2.x? Because they're probably going to have architectural impact on other parts of the system too.

oetiker commented 10 years ago

the whole storage layer will be much better structured, so that it should be possible todo both ... have mmap as well as normal i/o in a clean implementation.

pabigot commented 10 years ago

But that's my point: The requirement in this issue ("reduced storage I/O") is satisfied using mmap(), with the responsibility for the hard parts delegated to the host environment. What underlying requirement forces you to also support a different solution to the same problem, where rrdtool-2.x becomes responsible for the hard parts?

rkubica commented 10 years ago

rrdtool does use mmap(), when accessed.

it's issue with mmap() is that it does not open and remain open then use msync(), which is because it's the file being shared not the memory of the file being shared. it can't do so for various reasons, or at least can't do so in an efficient manner yet ( 2.x ! )

in other words, there are a few dataformat changes that need to be made to make this simpler for application use an OS to manage via mmap() and likewise still 'work ok' (or as well) in a non-mmap world.

pabigot commented 10 years ago

I don't understand much of that last comment except the point that rrdtool 2.x would have to call msync() to make updates visible, which doesn't seem any different from how rrdcached handles flushing now. I'm still unclear what's motivating the desire/plan to implement the storage layer in two different ways, but really it's not my problem to solve.