iiasa / ixmp4

A data warehouse for high-powered scenario analysis in the domain of integrated assessment of climate change and energy systems modeling
https://docs.ece.iiasa.ac.at/ixmp4
MIT License
10 stars 6 forks source link

WIP: Update `parameter.add_data()` functionality #112

Open glatterf42 opened 3 weeks ago

glatterf42 commented 3 weeks ago

As discovered in #108, we need a different implementation of parameter.add_data(). Ideally, we would like to make use of the JSONB type in-DB to avoid parsing the JSON field. However, with my current implementation, even that (which is only possible on postgres in the first place) is not at all faster than the pandas implementation (i.e. parsing the field to python, using pandas, writing it back). I've tested this with a parameter with 1 indexset with 1000 values and with 60 indexsets with 1000 values (though the parameter was limited to 10000 values and units there): pandas is faster. Some numbers from my most recent test tun:

--------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_add_data[sqlite]           79.8390 (1.0)       79.8390 (1.0)       79.8390 (1.0)      0.0000 (1.0)       79.8390 (1.0)      0.0000 (1.0)           0;0  12.5252 (1.0)           1           1
test_add_data[postgres]        128.1381 (1.60)     128.1381 (1.60)     128.1381 (1.60)     0.0000 (1.0)      128.1381 (1.60)     0.0000 (1.0)           0;0   7.8041 (0.62)          1           1
test_add_data_json[sqlite]     276.1958 (3.46)     276.1958 (3.46)     276.1958 (3.46)     0.0000 (1.0)      276.1958 (3.46)     0.0000 (1.0)           0;0   3.6206 (0.29)          1           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You'll notice that test_add_data_json[postgres] is missing: this test currently fails because it struggles to read in the column.names at one point. Hardcoding them does work, so it's doable, and you can test locally: it's still slower than pandas.

In general, _json indicated the variants working directly in-DB, while those without suffix are the pandas versions.

We might want to consider something like this if we want to use temporary tables in postgres.

Finally, for the 10000-values case, the in-DB workflow was not just slower: it didn't manage at all. My system did still work, but the test case wouldn't finish in more than 30 minutes. Pandas, on the other hand, took 2.8 seconds max (mean of 2.5). So please, @danielhuppmann, let me know if using pandas is okay or if we should find a way to utilize JSONB (which would be nice for sure). And if we go for JSONB, @meksor, please help me figure out the proper way to use it, i.e. one that's actually faster than pandas.

danielhuppmann commented 3 weeks ago

Thanks @glatterf42! I didn't have a chance to look at the code in detail, but if pulling all data and doing the comparison in memory is faster, that sounds like a good approach to me.