As discovered in #108, we need a different implementation of parameter.add_data(). Ideally, we would like to make use of the JSONB type in-DB to avoid parsing the JSON field. However, with my current implementation, even that (which is only possible on postgres in the first place) is not at all faster than the pandas implementation (i.e. parsing the field to python, using pandas, writing it back).
I've tested this with a parameter with 1 indexset with 1000 values and with 60 indexsets with 1000 values (though the parameter was limited to 10000 values and units there): pandas is faster. Some numbers from my most recent test tun:
You'll notice that test_add_data_json[postgres] is missing: this test currently fails because it struggles to read in the column.names at one point. Hardcoding them does work, so it's doable, and you can test locally: it's still slower than pandas.
In general, _json indicated the variants working directly in-DB, while those without suffix are the pandas versions.
We might want to consider something like this if we want to use temporary tables in postgres.
Finally, for the 10000-values case, the in-DB workflow was not just slower: it didn't manage at all. My system did still work, but the test case wouldn't finish in more than 30 minutes. Pandas, on the other hand, took 2.8 seconds max (mean of 2.5).
So please, @danielhuppmann, let me know if using pandas is okay or if we should find a way to utilize JSONB (which would be nice for sure).
And if we go for JSONB, @meksor, please help me figure out the proper way to use it, i.e. one that's actually faster than pandas.
Thanks @glatterf42! I didn't have a chance to look at the code in detail, but if pulling all data and doing the comparison in memory is faster, that sounds like a good approach to me.
As discovered in #108, we need a different implementation of
parameter.add_data()
. Ideally, we would like to make use of theJSONB
type in-DB to avoid parsing the JSON field. However, with my current implementation, even that (which is only possible on postgres in the first place) is not at all faster than the pandas implementation (i.e. parsing the field to python, using pandas, writing it back). I've tested this with a parameter with 1 indexset with 1000 values and with 60 indexsets with 1000 values (though the parameter was limited to 10000 values and units there): pandas is faster. Some numbers from my most recent test tun:You'll notice that
test_add_data_json[postgres]
is missing: this test currently fails because it struggles to read in thecolumn.name
s at one point. Hardcoding them does work, so it's doable, and you can test locally: it's still slower than pandas.In general,
_json
indicated the variants working directly in-DB, while those without suffix are the pandas versions.We might want to consider something like this if we want to use temporary tables in postgres.
Finally, for the 10000-values case, the in-DB workflow was not just slower: it didn't manage at all. My system did still work, but the test case wouldn't finish in more than 30 minutes. Pandas, on the other hand, took 2.8 seconds max (mean of 2.5). So please, @danielhuppmann, let me know if using pandas is okay or if we should find a way to utilize JSONB (which would be nice for sure). And if we go for JSONB, @meksor, please help me figure out the proper way to use it, i.e. one that's actually faster than pandas.