vergoh / vnstat

vnStat - a network traffic monitor for Linux and BSD
GNU General Public License v2.0
1.45k stars 123 forks source link

Database size growing at alarming rate. Best ways to curb it? #244

Closed rpodric closed 1 year ago

rpodric commented 1 year ago

Coming from the world of 1.x, where the files were a few KB and seemed to stay that way forever, there's more of a learning curve, it seems, with Sqlite, where the database size is growing by leaps and bounds over just the course of one week. That may be irrelevant on a large system, but not on an embedded one like a router, which is where I use it (OpenWRT, with Vnstat 2.9-1). And since it may never be compacted (or "vacuumed," as I think it's called in the Sqlite world), this is of more interest.

I have not made any changes to vnstat.conf yet. This would have obvious effects on size:

;5MinuteHours 48 ;HourlyDays 4 ;DailyDays 62 ;MonthlyMonths 25 ;YearlyYears -1 ;TopDayEntries 20

Though I would have to severely curtail it at the rate I'm going (no more than a few weeks).

So I'm wondering if there's anything else that would have an effect, like:

how often (in minutes) data is saved to database ;SaveInterval 5

create database entries even when there is no traffic (1 = enabled, 0 = disabled) ;TrafficlessEntries 1

use SQLite Write-Ahead Logging mode (1 = enabled, 0 = disabled) ;DatabaseWriteAheadLogging 0

change the setting of the SQLite "synchronous" flag (-1 = auto, 0 = off, 1, = normal, 2 = full, 3 = extra) ;DatabaseSynchronous -1

vergoh commented 1 year ago

The database size in versions 1.x was static regardless of the amount of data in it. It was also static from configuration perspective since it could contain only specific numbers of hours, days and month but nothing more and those numbers couldn't be configured at all.

With versions 2.x, the database is indeed a real database provided by sqlite. The nature in sqlite is that it will overallocate the necessary database size to have some buffer if there's more data coming in and then repeat this process once the existing size limit is reached. Another difference to note is that the database contains the data for all interfaces while versions 1.x had separate database files + backups for each interface. Versions 2.x also have the new 5 minute resolution entries by default for the last two days, which results in 576 entries per interface. That's easily the majority of the database size from database row count perspective since the other resolutions don't store so many rows being active. As a result, if you really want to minimized the database size then you can set 5MinuteHours 0, assuming you have no use for that resolution.

However, do note that the growth of the database isn't linear by time. It grows the most during the first 48 hours due to the 5 minute resolution, then less during the first 4 days and even less during the first 62 days. After that it will grow far less during the next 25 months and the yearly growth isn't likely to be of any concern since it's only one extra row per year after that (which obviously can also be disable if not needed).

You didn't mention which sort of file size you see and what you'd consider too big. As an example, in one system where vnStat has been running since 2010 with one interface with the yearly data since 2017 and default data retention settings, the database file is currently 79872 bytes. Looking from backups, the filesize has also been exactly the same at least for the last 2 months. You can multiply that by the number of interfaces you are monitoring for getting some idea what the database filesize will eventually be.

rpodric commented 1 year ago

Great background.

I'm 2 to 3 days in on two different routers. The one recording 2 interfaces is sitting at 172KB, while the one with 4 is at 236KB (imported history in each case just dates to sometime last year), and until your message I didn't have any expectation that would be slowing anytime soon. It sounds like that could be the extent of it though, but we'll see. It's interesting that your much older one held lower, though that was one interface.

Do they ever decrease? For example, if someone decides to discontinue the 5 minute resolution, as I might, but they do so only after 48 hours, is that space ever recovered without deleting and starting over?

Thanks

vergoh commented 1 year ago

I did some quick testing regarding how the database filesize is affected by the number of rows in it:

However, looking at the implementation side, there appears to be a little bit of a problem, bug or lack of implementation depending on how you want to see it. There's currently no logic for calling vacuum if any of the data resolutions is disabled and the existing data will not get automatically discarded either if some data existed when the configuration was set to 0. As a result, your options for reducing the database filesize if you opt to change set the configuration of any of the items to 0 are one of the following:

  1. Wait for the next release, I'll fix this clear issue
  2. Set the data rentention to 1 instead of 0 as still will still trigger the data cleanup but you'll then still need to vacuum manually for getting the filesize minimized
  3. Use 0 and discard the data manually followed by vacuum
  4. Start with a fresh database (most likely the least practical solution)

For "manual" option, you'd need to modify the database with the sqlite3 cli command.

rpodric commented 1 year ago

Very informative!

Will the changes you're making to the next release require vacuum to actually be available? Because I don't think it is in some embedded vnstat2 scenarios, unless it really is part of libsqlite3-0 (which is installed along with vnstat2) in some subtle way. If it will, then a new dependency, such as sqlite3-cli (or whatever is appropriate), would need to be accounted for by those who maintain the packages.

Thanks

vergoh commented 1 year ago

https://www.sqlite.org/lang_vacuum.html gets executed via sqlite3_prepare_v2() (https://www.sqlite.org/c3ref/prepare.html) + sqlite3_step() (https://www.sqlite.org/c3ref/step.html). Those are part of libsqlite without dependency to sqlite cli.

rpodric commented 1 year ago

Hi there. I thought I'd check back in and mention these results, which are from one week and having started fresh with no history and two interfaces. I also made sure to uncomment and set "5MinuteHours" to 0 right away.

DB size is 368KB, so I think I might have misunderstood the way to minimize this (in the existing version).

Thanks

vergoh commented 1 year ago

That's indeed larger than predicted. I wonder if this has something to do with the OpenWRT setup, the used architecture or musl being used instead of glibc. I'll have to see if I can come up with some sort of test environment for this.

rpodric commented 1 year ago

Anything I can do to help. as it seems to reliably grow about 50K a day (476K now, for example). I should clarify that the version does lag a bit behind current: https://openwrt.org/packages/pkgdata/vnstat2

vergoh commented 1 year ago

I haven't so far found anything that would explain the database size you see and the apparent growth associated to it. Looking at those systems that I have available, those previous quick test numbers don't appear to be that accurate as it feels like the sqlite version used to originally create the database will also affect it a little bit as older versions had a page size of 1024 bytes while currently releases use 4096 bytes. However, this shouldn't really matter once there's more data in it but it's still odd that the largest longer running database I have is only 264K and that's with 4 interfaces.

I checked the configuration for OpenWRT from https://github.com/openwrt/packages/tree/openwrt-22.03/libs/sqlite3 and there's nothing in there either which would, for example, select some features that could explain the database file size. With an arm based raspberry pi and running Alpine in a container for getting musl I did get 224K for two interfaces, but that's still short of the 476K you see (unless there's also some filesystem catch with OpenWRT I'm not aware of).

Would it be possible for you to make that database file downloadable or to send it to me via email? I'm somewhat running out of ideas other than to look what's inside and see if the size can be explained by the number of rows or some other content.

rpodric commented 1 year ago

224K after how long though? If you mean after 3 or 4 days, that's on track.

I bet something will jump out at you when looking at the db.

vergoh commented 1 year ago

That 224K was after few years.

Looking at the content of your database, it shows 6266 rows of 5 minute data, which would explain the size and the growth. Please check that you indeed have 5MinuteHours in the configuration set as 0 since the data would suggest it's currently -1 or some large number.

rpodric commented 1 year ago

It was (and is) set, so I'm not sure how that can be unless some other line in vnstat.conf is relevant. And at the time I ran this, which was probably necessary to make it take effect: /etc/init.d/vnstat restart

5MinuteHours 0 ;HourlyDays 4 ;DailyDays 62 ;MonthlyMonths 25 ;YearlyYears -1 ;TopDayEntries 20

rpodric commented 1 year ago

I thought of something that might explain this, and if true would be due to my misunderstanding. Do you think that 5MinuteHours should be on 0 before the database even exists? Because that I didn't do; I set it a minute or so afterwards. If true, I wonder if this is something that the upcoming change you're making would effectively compensate for, because it would essentially be vacuumed to near-nothing.

vergoh commented 1 year ago

The database you sent has 5 minute data starting from 2023-03-11 17:30 and ending 2023-03-22 15:30 so it's behaving as if the configuration 5MinuteHours had been set to -1 or some large number. I'll have to test this locally if there's some bug somewhere since just by looking at the code path, the data input happens in https://github.com/vergoh/vnstat/blob/master/src/dbsql.c#L848 followed with condition https://github.com/vergoh/vnstat/blob/master/src/dbsql.c#L888 which clearly isn't being followed. If you changed the configuration after starting the daemon and not restarting it then the previous configuration should still be in effect and this shouldn't be happening either (unless that previous configuration had -1).

You are correct, the configuration should be changed before the database exists and before the daemon is started if you'd want to make sure that the database doesn't contain the data set as 0. That's the limitation of the current (up to and including 2.10) releases.

vergoh commented 1 year ago

Ok, found a bug which makes all current (up to and including 2.10) releases misbehave if any of the data retentions is set to 0. The behaviour ends up being same as if -1 had been used. The issue is with https://github.com/vergoh/vnstat/blob/master/src/dbsql.c#L888 where

if (featurecfg[i] == 0) {

should have been

if (*featurecfg[i] == 0) {

for making the correct comparison. For some reason, I had missed writing a test case for this condition and that's why this wasn't found earlier. Thanks for reporting and persisting with the issue!

The closest workaround for minimizing the database size in your case would be to set 5MinuteHours with value 1. That will result in some data getting logged for the 5 minute resolution data but not that much that it would make a big difference in the database size. You'll however again need to start with a new database due to the vacuum command not being executed in version 2.9 for reclaiming the freed database space.

rpodric commented 1 year ago

Thanks again for all the attention on this. You're going to get a kick out of this, and I saw it before but thought it was a problem with using 1, but I'm finding that in this version on this platform, if 5MinuteHours is less than 48, then the Vnstat GUI breaks: https://i.imgur.com/cHilB1R.png

I think that's why I went to 0.

I see this in the release notes for v2.10: "Image output of 5 minute graph wasn't possible if data retention of the 5 minute time period was configured as unlimited," which sounds related but it doesn't break for me when unlimited, which is interesting since I'm using the version before v2.10. So, somehow I'm not seeing the problem that was fixed in the next version but am seeing one that wasn't.

rpodric commented 1 year ago

If you have any ideas on the display issue when it's set anywhere from 1-47, I'll delete and start over on 1, but it's a little hard to do that before then. Perhaps another setting also should change at the same time?

Checking in on the two routers, One is at 552KB, so it has slowed its pace. The other, which started at the same time but has 4 interfaces, is one that I haven't been watching that closely but is at 1.3MB now.

vergoh commented 1 year ago

I'm guessing the OpenWRT dashboard(?) you have in that screenshot is trying to produce the images using the vnstati command by using the -o - parameter which will directly dump the image to stdout which the dashboard then redirects to the image request for your browser. The problem in this case is that the configuration file is getting validated and this print is being show if set 5MinuteHours 1:

Config: Value "1" for 5MinuteHours is too small for 5MinuteGraphResultCount with value "576" (smallest supported value: 288). Value for 5MinuteHours needs to be at least 24 if 5MinuteGraphResultCount is set to 288.

Things break because this config validation print isn't aware which parameters have been given to vnstati and will display this information regardless of the parameters. This applies to all output types but more critically also to -o - which results now in broken images since every image "file" will have that line above on the first line. The reason why 288 is the minimum for 5MinuteGraphResultCount is because with smaller values the graph becomes too narrow to fit all necessary image elements and 288 is 24 hours worth of 5 minute data.

Due to the configuration validation, versions starting from 2.7 are affected and the smallest usable value for 5MinuteHours is 24 and 5MinuteGraphResultCount needs to be set to 288 if image output is used with the -o - parameter. Another issue to get fixed for version 2.11.