spine-tools / Spine-Toolbox

Spine Toolbox is an open source Python package to manage data, scenarios and workflows for modelling and simulation. You can have your local workflow, but work as a team through version control and SQL databases.
https://www.tools-for-energy-system-modelling.org/
GNU Lesser General Public License v3.0
72 stars 17 forks source link

SpineDBManager: What to do if DB doesn't fit in memory? #1816

Open manuelma opened 2 years ago

manuelma commented 2 years ago

The SpineDBManager class is what controls all communication with Spine DBs within Spine Toolbox. It provides a cache where we store the results of all our SELECT queries - basically a Python dictionary mapping db handle, to table name, to row id, to full row. UI components that need to show data from a database are only given the db handle, table name, and id, and they would request the full row from SpineDBManager's cache as needed.

The cache is also used to run our ad-hoc integrity checks, whenever the user adds or modifies data in the database. Basically, say the user is adding a new row to a table, then we need to check that it doesn't conflict with any other row and has all the necessary references in other tables.

For visualization, the cache is populated lazily as UI components request data. In other words, we don't load all the data from the DB into the cache immediately after we create the DB connection; we do it little by little as the user 'expands' the UI.

But for integrity checking, we absolutely need to have the entire contents of the database in the cache for - I would say - obvious reasons.

The question is, what will happen if a DB is large enough so it doesn't fit into the memory that is allocated to the Spine Toolbox process? Will we at some point see Python's MemoryError followed by a crash? And if so, how to remediate it?

We know for experience that if we ditch the cache approach and load data directly from the DB every time we need it, performance dramatically decays. Do we need some sort of hybrid approach (fill the cache whenever possible and otherwise read from the DB)? Could we do something special for SQLite (use an in-memory database for instance)?

jkiviluo commented 8 months ago

But for integrity checking, we absolutely need to have the entire contents of the database in the cache

Big data in Spine DBs is almost always in parameter value fields. If we are not integrity checking inside parameter values, we should not need to read parameter value fields. Should we have another function 'get_parameter_value_info' or something like that where the value column is not fetched?

We are planning to have new parameter types where integrity checking should be done inside parameter value fields, but those would hopefully not be big data and could be handled separately since they would have their own data type. (So, fetching parameter value fields only for specific datatypes).

jkiviluo commented 8 months ago

Just realized this is way old - why did it come up in the recently updated issues... Anyway, maybe the issue still stands.

manuelma commented 8 months ago

Yeah, the details are old but the concept still applies.

I feel get_parameter_value_items itself could not bring the value, and only when the user access the value field, (like item["value"]) we query and fill the value. It would be like an additional level of lazy loading.