Improve performance of data collection

vvmruder commented 2 months ago

Intro

Coming back to #1544 we can see that there are many redundant steps which are executed. This is/was mainly introduced to have a configurable server which can be altered at runtime. Means the data integrator can change content on the database without restarting the server. It turned out, that this usecase is a rare one. Probably it is not used at all. Instead, new data is provided by a regular deploy which means the underlying data and the server is completely re setup.

Initialising things once

One of the most interesting points which was not touched in the recent refactorings and performance improvements is the initialisation of the processor. It is initialized everytime an ÖREB related endpoint is called. If we agree on the statement made in the intro, it would be one of the most efficient performance catches if we refactor pyramid_oereb to initialize the processor only once at boot time. This would cut down all initilisation process which is done in this method.

https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/views/webservice.py#L198
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/views/webservice.py#L217
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/views/webservice.py#L254
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/views/webservice.py#L270 The first 3 pointers are already improved that way, that they initialize only the real estate source and ommit the rest. This was done to improve performance. We could improve it in general if we do that processor initialisation once on server start up.

=> this has to be discussed, as it is a organizational decision to make pyramid_oereb recognizing the configuration and datasorurces only at boot time and not on every request.

Parallelisation

I see on potential place where we could hook in for proper take advantage of parallel processing: https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/readers/extract.py#L51-L104

Here all iterative querying to the sources is bundled and here we could take action. However we should discuss which technique we want to use and if this should be configurable.

asyncio

Since we build up onto recent python versions in this project, we are able to use asyncio in combination with SQLAlchemy. This is probably the best solution in terms of future proof setup. However it comes with some down sides. Asyncio is not 100% available in all python stack and libs we may depend on. So a major task would be to research where we might be blocked to use that. The most up side of this solution is its scalability and the resource saving solution we would have.

multiprocessing / threading

A well known way of implementing iterative parallel tasks. We easily could implement that. The main disatvantage here is the forking. Threads in one solution or processes in the other, may introduce much more load onto the metal server in the end. So we should discuss how we can avoid bruteforcing wether the database or our servers in the end. In my opinion we could avoid that with some additional configuration where one can set the number of threads or processes to be allowed.

SQLAlchemy Session Management

A thing we also need to research, is the way we currently implement our session sharing: https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/adapter.py#L12-L73

It is some home made way to improve things for long time running servers to not collect too many open DB sessions. Currently Iam not aware of the influence that would have in a parallel context. Not for Threading NOR Processing NOR Asyncio.

voisardf commented 2 months ago

@vvmruder Thanks for the work, we will study and discuss the point in the PSC @michmuel @svamaa

voisardf commented 1 month ago

@vvmruder After discussion in the PSC, could you provide an time estimate for the changes necessary to realise the 'Initialising things once' part above? On our side we will check with the usergroup if everybody uses only the standard and interlis source configurations

michmuel commented 1 month ago

@vvmruder, @svamaa, @voisardf We had some more discussion in the PSC concerning the task "initialising things once". It is important for us that routine operations such as updating data of particular themes or updating real estate information can be performed without a server restart. However, changes in configuration such as the change of the data source of a topic (database/database schema) may require a server restart.

openoereb / pyramid_oereb