Performance of generation of profiles

I've been looking to use ALPG to generate profiles for my thesis. While running the generator, I've noticed that it takes quite a while to run. I've looked into it and I think it can be improved by doing 2 things: generating data for multiple households at the same time and writing the data to disks in a different way. For both of these issues, I've done some quick tests and it seems feasible. Separately they've both shown significant execution time reduction (20% or more on my laptop)

Generating data for multiple households

Currently the generator generates the data for a single household at the time, then writes it to disk and then starts on the next household. If I understand the code well, there is no normalisation or relation between the data of households of any kind. It could thus be parallelized at the cost of having a bit more randomness in the output data as the random function is not called in the same order for the same house for the data generation.

Writing data to disk in a different way

The generator currently generates one data for one household, then writes it to disk. Because of the format it is stored in, it has to look for the end of a line and append it. This is quite time consuming. There are different ways to solve this, but I propose two.

The first would be a change in output format. Instead of splitting the data over multiple files, it could be that a single file contains all the data of one household. This way, the file can be written separately, without having to look anything up and as one big batch of data.

While I do think that it is a nice output format and would actually help me a lot (since I'm now working on a lot of code to split and rearrange the data back in to households), I don't think this is the right way to go. The reason is that this will significantly change the output format and will require a rewrite of any program that uses input based on ALPG.

The second solution would be to generate data for multiple houses and write them away at the same time. This is faster as it reduces waiting for IO. The downside is that it will require more RAM as it needs to store the data of multiple houses at the same time in memory.

Proposal

I propose to implement concurrent data generation for households and implement batched writing to csv files which are both configurable withing the config. This way, one should be able to run ALPG as originally written. However, if one chooses to do so, it will lead to faster data generation which can be quite significant if generating a large amount of houses

I know this is quite out of the blue, so please let me know if you're even looking for something like this. Any thoughts and comments are much appreciated.

Thanks for your message! Your analysis seems to be right and I do believe that your proposal is a nice method to keep support for legacy systems, while improving the generation speed significantly. I wouldn't go for the second solution as it may indeed occupy too much RAM at some point (one of the models I have on my disk is roughly 5GB in total). I think that a different data structure is not bad and, if kept largely similar, it should be able to incorporate this in other tools such as our own Smart Grid simulation tool DEMKit.

One thing I would add is a random seed for each house genetion process such that models can be regenerated and easily shared among researchers as input datasets (e.g. for comparison of optimization methods).

Furthermore, you can also have a look at one of the forks that implemented a multitprocessing: https://github.com/MartinSchmidt/alpg

To answer the question whether we are looking for something like this: I am not actively looking for anything on this matter. As you may have seen, the development is not really active, except for some bugfixes. In my own opinion, a complete rewrite of the tool would be best in the end. But for now it does satisfy our demands, albeit very slowly indeed ;) . Since it is not our core focus, time and resources are limited. However, you are free to fork it or to propose a pull request to enhance the current tool and I am happy to incorporate it!

Furhtermore, good luck on your thesis and feel free to contact me on my University of Twente email address if you would like to know more about the ALPG or oru research.

utwente-energy / alpg