jim-easterbrook / pywws

Python software for USB Wireless WeatherStations
https://pywws.readthedocs.io/
GNU General Public License v2.0
204 stars 62 forks source link

SQLite storage optimisation #93

Closed mbirth closed 3 years ago

mbirth commented 4 years ago

I've noticed my text-based data only takes 71 MiB but after transferring everything into a SQLite database, the database is at 105 MiB.

This is probably due to floating-point numbers always being stored in 8 Bytes. However, I've also noticed my weather station (WH1080) only has 1 digit after the decimal point. So by multiplying all fp-numbers with 10, we could store them in an integer column and thus a value like e.g. 25.3 would be stored as 253 and only take 1 Byte instead of 8. A 33.4 would be stored as 334 and take 2 Bytes. So these 2 numbers alone would save 13 Bytes already.

Maybe add a config value precision that allows to set the factor by which fp-numbers are multiplied. In case other weather stations can deliver more precise numbers.

jim-easterbrook commented 4 years ago

Surely the only reason to use the SQL backend is to allow other programs to access the data, in which case it really needs to be in normal numbers.

mbirth commented 4 years ago

For testing, I manually multiplied all fp-values in the raw and calib tables by 10 in my database of 735790 records and did a VACUUM. The database is now at 67 MiB (from 105). (By also multiplying all remaining _min and _max values in the other tables - i.e., everything that's not calculated but taken 1:1 from the weather station, - I've only saved another 3 MiB. So it's not really worth it.)

So only changing those 2 tables would already improve things a lot. If it's documented, I don't see a reason why 3rd party apps couldn't do a SELECT temp_in/10.0 AS temp_in FROM calib or divide by 10 after selecting. If they're not using the pywws class to access the data in the first place, that is.

Running that thought somewhat further, you could also drop the monthly/daily/hourly tables and use the aggregate functions of SQLite for that.

A big advantage of using SQLite as the storage backend is also that you can easily backup ALL data by copying one single file.

jim-easterbrook commented 4 years ago

If this can be done without affecting normal "filestore" use (i.e. no change for the majority of users) then I suppose it might be worth it. I don't use SQL so I'd need the opinions of others who do: @jarvisms is the only one whose GitHub ID I know. Richard Turan and Ian Sutherland have recently been talking about SQL on the pywws mailing list - I'd like to see this discussed there first.

jim-easterbrook commented 4 years ago

PS My data is now around 200 MByte. I'd rather not back that up as a single file - daily backups of the few files that have changed is very quick.

jarvisms commented 4 years ago

The sqlite3data module aims to be as transparent to pywws as possible for full compatibility and 100% equivalence to filedata, but also so its easy to interrogate the data directly via sql queries with readable results for external apps, and lastly be reasonably portable so it could form the basis of another SQL based backend module (i.e. for MySQL or whatever).

I did originally store floats as x10 integers to save space by catching the WSFloat type and converting it on the fly, but this lost precision with the calculated tables (such as wind_ave). Of course this can be changed with some deeper modifications to pywws or more elaborate converters but, for me, it wasn't worth the effort (disk space is cheap afterall). Third party apps must then also do these conversions so the last goal above isn't met, even though its easy and can be documented. Currently, the only conversions in play are for the status bits (which most people won't care about anyway) and the timestamp where standardised format is used instead of plain text.

Aggregate functions were also considered via SQL views for the various summary tables (daily, monthly etc.) but it turned out not to be straight forward to support the day end hour and pressure offset config options. It's also difficult to calculate all of the calculated parameters directly in a way that's equivalent to the original python code - think about average wind direction with exponentially smoothed vectors! This was certainly beyond my ability/patience!

I'd be interested to see this module expanded if you want to fork the repo - particularly if the original objectives above can still be met. As Jim says, it would be good to get some feedback from the others in the mailing list forum who look like they may be using sqlite3 for third party use outside of pywws. These posts can be copy & pasted there to everyone is up to speed.