Possibly poor performance compared to pyramid_oereb 1.8.0

peterschaer commented 2 years ago

In our tests we are observing a noticeable performance difference between our current productive environment (based on pyramid_oereb 1.8.0) and the new version based on pyramid_oereb 2.0.0b14. One may observe the difference here:

pyramid_oereb 1.8: https://www.oereb-test.apps.be.ch/extract/reduced/json/CH173510804618 pyramid_oereb 2.0: https://www.oereb2-test.apps.be.ch/extract/json?egrid=CH173510804618 (the DNS entry is not public yet, you can use https://159.144.217.124/extract/json?egrid=CH173510804618 instead but you have to ignore the security warnings about the certificate)

Both instances are running in the very same kubernetes cluster in our data center. They are connecting to the same PostgreSQL 12 database (but different schemas because of the changes made to the transferstruktur data model). The data is the same (the old transferstruktur was transformed into the new one). With pyramid_oereb 1.8 the JSON extract is created in under one second (700-800ms), with pyramid_oereb 2.0 the JSON extract is created in 3.5-4 seconds!

I did some researching and profiling (with debug-toolbar und print-Statements). I have the impression that the difference has to do with the database queries. E.g. just creating the processor https://github.com/openoereb/pyramid_oereb/blob/e291d8a5adccf6e7aa61b58d60f66f4d1e655d5c/pyramid_oereb/core/processor.py#L321 takes about 2.5 seconds. The slow part ist querying the availabilities https://github.com/openoereb/pyramid_oereb/blob/e291d8a5adccf6e7aa61b58d60f66f4d1e655d5c/pyramid_oereb/contrib/data_sources/standard/sources/plr.py#L121 But this is a very simple query returning 338 rows in our case. I don't see anything to optimize here. This query takes about 0.07s. But it is done 31 times = 2.1s. But this part of the code hasn't changed from pyramid_oereb 1.8 and 2.0. The old version is also querying the availabilities of all 31 topics. But it does it much quicker. I suppose there's some overhead on each query and with more than 200 queries for an extract this adds up. But I can't narrow this down. May be there is some change is one of the used libraries or it has to do with the python version (pyramid_oereb 1.8 is running on Python 3.6, pyramid_oereb 2.0 is running on Python 3.7.

I think I can exclude the network as the source of error. Because when running both the application and the database as docker containers on my development environment I get the same results. In this setup there is no network between application and database.

I'm using the standard model for all 31 topics we have.

I cannot explain this difference therefore I'm posting this here. Has anybody during their tests observed something similar?

vvmruder commented 2 years ago

@peterschaer I will have a look. We changed behavior a little regarding server start up and other things. This might be related.

vvmruder commented 2 years ago

From my fist check I see: You are right we did not change this for long. We initialize availability on every request because than a administrator is able to switch of a municipality on the fly at runtime. We could load the availability on server start. We do it now for several things (Glossary, Disclaimers, etc.). But this would mean availability cant be changed without server restart. Otherwise I do not see any reason why this might be slower. Still investigating.... @michmuel @voisardf Do you experience similar performance decrease?

peterschaer commented 2 years ago

Just an update from my side: I upgraded our application to pyramid_oereb 2.0.0rc1 and Python 3.8 but unfortunately there was no improvement in performance. So the issue is still bothering me. Has anyone encountered a similar performance decrease with pyramid_oereb v2 compared to v1.x?

michmuel commented 2 years ago

We run Oereb V2 in another environment (OpenShift) than Oereb V1. Therefore, we do not know exactly the difference in performance between the two versions that is related to pyramid_oereb. We are optimizing the OpenShift settings in terms of performance and stability in these days. At the moment Oereb V2 is certainly not running faster than V1.

@vvmruder and others: Is it possible that using a newer version of sqlalchemy had an effect on performance?

jwkaltz commented 2 years ago

@michmuel it's not impossible that library upgrades have an influence, but I think it's much more likely that it's due to the V2 code changes. Profiling is needed to determine that.

vvmruder commented 2 years ago

@peterschaer I had a deeper look on this. But still can't find the real bottle neck. Would it be possible to get some date from you to have a real world use case?

vvmruder commented 2 years ago

@michmuel & @jwkaltz I think there is a issue in pyramid_oereb. Can't find it right now. I would suspect external libraries as last option too :)

peterschaer commented 2 years ago

@peterschaer I had a deeper look on this. But still can't find the real bottle neck. Would it be possible to get some date from you to have a real world use case?

Of course, in which form/way should I transmit it to you? You can get our application here (branch v3): https://github.com/AGIBE/oereb_server/tree/v3 or as a docker image here docker pull ghcr.io/agibe/oereb_server:3.0.0rc1

peterschaer commented 2 years ago

I did some profiling on my own. I follow the instructions in the SQLAlchemy Docs and profiled the availability query in https://github.com/openoereb/pyramid_oereb/blob/af0e124f1fa20d6cca5df8e505da2346f58c8df0/pyramid_oereb/contrib/data_sources/standard/sources/plr.py#L122

this query is executed 31 times (once for each topic). On my development machine it takes 0.063s. 0.063s x 31 = 1.953s. That means this initial step alone takes two seconds to execute. All the extract processing and querying is still to follow.

this is the profile output of one single availability query. I don't include the others, they all look the same: availability_profile_2_0_0rc1.txt at the end is the summary: 25713 function calls (24306 primitive calls) in 0.063 seconds

I did the same thing on the same machine with the old version of our application (based on pyramid_oereb 1.8.0). I profile the availability query in: https://github.com/openoereb/pyramid_oereb/blob/6666f02c7825a172761d872b8d1017d54087b40a/pyramid_oereb/standard/sources/plr.py#L74

this query is also executed 31 times (once for each topic). But here it takes only 0.005s. 0.005x x 31 = 0.155s which is acceptable.

This is the profile output of one single availability query. availability_profile_1_8_0.txt the summary: 851 function calls (826 primitive calls) in 0.005 seconds => much quicker, much less function calls

I still have no idea why this is like it is. But maybe somebody has an idea.

vvmruder commented 2 years ago

Great @peterschaer Thank you :) This would mean-there is a change in SQL-Alchemy which decreases performance indeed (as @michmuel suggested).

We could work around this as already discussed by quering the availabilities only on server start up.

But now I'am concerned how this SQL-Alchemy thing affects pyramid_oereb.

peterschaer commented 2 years ago

One possible lead may be the Caching mechanism in SQLAlchemy (introduced in v1.4). This cache does not cache the data but only the SQLAlchemy structures belonging to a query. Whether a statement was cached or not can be analysed in the logs (and in the debugtoolbar) when activating SQLAlchemy logging in https://github.com/openoereb/pyramid_oereb/blob/af0e124f1fa20d6cca5df8e505da2346f58c8df0/pyramid_oereb/core/adapter.py#L32 by setting echo=True

for an extract in our application 130 SQL statements are issued to the database. There are:

49 cached statements ([cached since ...]) => that's ok. Those statements are using the cache. 13 of them are SELECT 1 statements from the health_check.
13 uncached statemens ([no key ...]) => these are mostly statements using a PostGIS-function (especially ST_GeomFromWKB). They probably don't implement the caching mechanism.
68 generated statements ([generated in ...]) => those statements generate an entry in the cache. But this entry is never used. On each new request those statements are reentered in the cache. They never use the cache. The SQL statements for the availability fall into this category. That seems weird to as these statements are far from being complex.

This is the full SQLAlchemy log for a single extract: sqlalchemy_cache_log.pdf Maybe this can help.

peterschaer commented 2 years ago

I also did a profiling of the SELECT 1 statements in the health check in https://github.com/openoereb/pyramid_oereb/blob/af0e124f1fa20d6cca5df8e505da2346f58c8df0/pyramid_oereb/core/sources/__init__.py#L83

the first execution of this statement has a summary of 2827 function calls (2751 primitive calls) in 0.064 seconds. This execution was surely uncached as it was the first after server startup. All subsequent executions of SELECT 1 have a summary of 246 function calls in 0.006 seconds. Those executions will have used the cache. Using the cache is certainly beneficial or: not using the cache is bad for the performance.

vvmruder commented 2 years ago

@peterschaer could you maybe share some numbers here again to have a comparison what is the gain of improvement with the changes in #1540?

jwkaltz commented 2 years ago

@peterschaer additional question / idea, You mentioned your production system is running pyramid_oereb 1.8.0, could you try to compare that a newer pyramid_oereb 1 version? Because the upgrade from sqlalchemy 1.3 to 1.4 happened just after pyramid_oereb 1.8.0. If you already experience the performance deterioration in the V1 system, then we can pretty sure it is really sqlalchemy (or the way pyramid_oereb uses it) the culprit, and not any code changes in V2.

peterschaer commented 2 years ago

@vvmruder I will share some performance numbers tomorrow (today I had to take care of other issues). And after that I will try the idea raised by @jwkaltz and measure the performance with pyramid_oereb 1.9.2.

peterschaer commented 2 years ago

@jwkaltz Would it be possible to make a new release (2.0.1 or so) after merging #1544 ? This way it would be much easier for me to make a new release on our infrastructure and to measure reliably the performance gains from #1544 and #1540

jwkaltz commented 2 years ago

@peterschaer a release 2.0.1 is now available https://pypi.org/project/pyramid-oereb/2.0.1/ There is a problem with the documentation generation, so this part is not up-to-date (we have a dedicated ticket for that). But you can see the changes if you look directly in the file CHANGES.rst

peterschaer commented 2 years ago

In my original post I mentionned that creating a JSON extract for my reference parcel (EGRID CH173510804618) takes 3.5-4 seconds with pyramid_oereb 2.0.0b14 and 0.7-0.8 seconds with pyramid_oereb 1.8.0. After all the improvements (especially #1544 and #1540) the processing time is now down to about 1.8s with pyramid_oereb 2.0.1! This is certainly a big improvement.

There is still a certain performance gap between 1.8.0 and 2.0.1 and that leaves the question whether this issue should be closed or continued. Maybe we can discuss this next week on the occasion of our user group meeting?

jwkaltz commented 2 years ago

@peterschaer could you try on your existing system with 1.8.1 or later, which includes the sqlachemy update? (See comment from April 6th for more details). If you already experience the performance degradation with 1.8.1, then we know the cause is the sqlalchemy update.

peterschaer commented 2 years ago

Yes, I was able to do this. I had to take version 1.9.2, because the update to SQLAlchemy 1.4. was made only there. Before that, 1.3. was always used. With this constellation I do not experience any performance degradation!! The processing time is the same as with 1.8.0.

In an earlier post I examined the SQLAlchemy Cache mechanism by looking at the SQLAlchemy Logs (https://github.com/openoereb/pyramid_oereb/issues/1508#issuecomment-1077588821). For 2.0.0 I got:

49 cached statements ([cached since ...])
13 uncached statemens ([no key ...])
68 generated statements ([generated in ...])

I did the same for 1.9.2 and got the following results:
155 cached statements ([cached since ...])
13 uncached statemens ([no key ...])
0 (!) generated statements ([generated in ...])

I redid the whole thing for 2.0.1 and got:

43 cached statements ([cached since ...])
13 uncached statemens ([no key ...])
38 generated statements ([generated in ...])

sqlalchemy_cache_log_192.pdf sqlalchemy_cache_log_201.pdf

So let's ignore the 13 ST_GeomFromWKB-Statements as they show the same behaviour in all versions. In 1.9.2 all other statements are using the cache. In 2.0.0 there are 68 statements that are not using the cache, in 2.0.1 this number has decreased to 38x thanks to #1540 and #1544). But there are still 38 uncached statements. My current guess is that if we can cause these statements to use the cache the performance would be about the same as in 1.9.2. But that's just a guess.

remyguillaume commented 2 years ago

Hello,

I think there is a performance problem due to the default lazy loading of some objects. Too many SQL queries are made, when one unique big query with joins would be a lot more efficient.

I've prepared a pull request for this : https://github.com/openoereb/pyramid_oereb/pull/1574

Could you please review it, and merge it if everything looks good for you? Thanks.

vvmruder commented 1 year ago

@svamaa please ask in the poll

lopo977 commented 1 year ago

We have still performance issues in the new implementation (2.0). Previous version in extrating (xml/json) was faster in 1.8.x. (tessin)

peterschaer commented 1 year ago

From our point of view, the performance of smaller plots is somewhat slower than in v1.8, while larger plots are significantly faster than before. Therefore, the performance is acceptable for us. However, if someone still has an idea to improve the performance, we are happy to help (testing, measuring etc.).

michmuel commented 1 year ago

Performance is ok for us (BL). However, good to hear if someone has an idea to increase the performance with little effort.

peterschaer commented 1 year ago

The performance of a GetEGRID-request may be improved (see #1680)

voisardf commented 11 months ago

@svamaa Get latest situation update from user group and create a new ticket if needed

openoereb / pyramid_oereb

Possibly poor performance compared to pyramid_oereb 1.8.0 #1508