Poor PostgreSQL performance in production

rawaa123 commented 4 weeks ago

@ralf-berger The worker for concept-map in production sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9zffth receives the job for KG generation but is still processing it since morning. Apparently there are no errors, but the job does not takes this long usually 2 min. Its kind of stuck and not proceeding any further. When I tried the same on edge, its working fine and efficiently.

I sent the job today morning at 10:43 to generate KG, I tried with multiple requests afterwards but the new request is not received by the worker as its still processing the first one, but actually nothing is going on as I can see in the logs.

this is how its working on edge:

Can you please figure out if something is wrong with production environment that hinders KG generation? Morover, Could you please restart the service sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9zffth. One more thing, Is it possible that i restart the service or should it always be done by you? If i can do it then how? could you please direct me.

ralf-berger commented 4 weeks ago

Not sure what you're using for the worker queues, but you should probably check how to configure a reasonable timeout duration so that jobs don't get stuck forever. E.g. for Celery: https://docs.celeryq.dev/en/stable/userguide/workers.html#time-limits

Ideally, failed jobs should be retried only a couple of times. And critical issues (out of memory, db connection lost) should still terminate the worker process, so that it can be restarted.

Also, if that doesn't increase memory usage significantly, you might want to check if you can increase the number of jobs handled in parallel.

One more thing, Is it possible that i restart the service or should it always be done by you?

Well, services are restarted automatically whenever the container image or the configuration is updated, the process terminates itself for whatever reason or was killed, e.g. due to a memory leak making it hit its memory limit or its healthcheck failing. So there shouldn't really be any reason to ever terminate any processes manually.

rawaa123 commented 4 weeks ago

Ideally, failed jobs should be retried only a couple of times. And critical issues (out of memory, db connection lost) should still terminate the worker process, so that it can be restarted.

You are right, this is implemented like this.

What we are not able to figure out is why it is stuck at one point? The thing is its working completely fine on edge, so how is production different from edge that it does not handle exactly the same thing in the same way as on edge? Is there any incosistency or difference in configuration etc? what could be the possible reason of this and how we can handle that?

ralf-berger commented 4 weeks ago

What we are not able to figure out is why it is stuck at one point?

If the timeout fails with a stack trace, you should be able to get some idea what's happening by looking at the logs.

what could be the possible reason of this and how we can handle that?

I think we should make the service somewhat resilient (timeouts, traces, health checks, …) and find out if this was an edge case, related to connectivity, input data, etc., or if it happens regularly.

Is there any incosistency or difference in configuration etc?

Not in any way that seems relevant. The difference in the running Pods is literally just image versions, labels, timestamps and unique ids.

Diff

```diff 1,2c1,2 < Name: sc-coursemapper-webserver-coursemapper-kg-concept-map-f598q9gth < Namespace: cmw-edge --- > Name: sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9zffth > Namespace: cmw-prod 6c6 < Start Time: Wed, 23 Oct 2024 16:29:33 +0200 --- > Start Time: Thu, 17 Oct 2024 14:45:23 +0200 9d8 < branch=edge 11,12c10,11 < pod-template-hash=f598b8545 < stage=preview --- > pod-template-hash=86d986f954 > stage=prod 15c14 < IP: 10.42.0.109 --- > IP: 10.42.0.67 17,18c16,17 < IP: 10.42.0.109 < Controlled By: ReplicaSet/sc-coursemapper-webserver-coursemapper-kg-concept-map-f598b8545 --- > IP: 10.42.0.67 > Controlled By: ReplicaSet/sc-coursemapper-webserver-coursemapper-kg-concept-map-86d986f954 21,23c20,22 < Container ID: containerd://a8401648c48bf8a61609c7f7f190bec7c62e2e19ea3503d97d32221195270213 < Image: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map@sha256:08531dbbaeb8134b85bbdd57f03187994b42657da5b81e0e62192649fac90d2b < Image ID: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map@sha256:08531dbbaeb8134b85bbdd57f03187994b42657da5b81e0e62192649fac90d2b --- > Container ID: containerd://5b69209acf273708dc7e4c25bf50f0a0647d8ccae4686275f6472e7c68885a79 > Image: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map:4.0.0 > Image ID: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map@sha256:2c17f5f0d6b00a35201fd3454e326e4a266862613f2c850ecf402651395ab09b 27c26 < Started: Wed, 23 Oct 2024 16:29:33 +0200 --- > Started: Thu, 17 Oct 2024 14:51:12 +0200 48c47 < /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lfmw5 (ro) --- > /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vgw75 (ro) 56c55 < kube-api-access-lfmw5: --- > kube-api-access-vgw75: ```

rawaa123 commented 4 weeks ago

@jeanqussa Please could you check at this issue?

rawaa123 commented 4 weeks ago

@ralf-berger I can see that the process is working in production, but it is taking hours to complete, while in the edge environment, the same task takes only minutes. We have proper error handling in place; if a timeout occurs, it will be logged, and we will attempt to reassign the task. If it fails again, this will also be recorded in the logs. Currently, the main issue is the time taken to extract the concepts, that takes hours while minutes in edge minutes.

Why is the production environment so slow while the edge environment performs much faster? Do you have any suggestions?

rawaa123 commented 4 weeks ago

@ralf-berger Following up on the above, the process doesn’t get stuck in production but takes significantly longer—hours instead of minutes, as it does in the edge environment. It regularly sends notifications each time it extracts concepts from the slide, along with the number of extracted concepts, as you can see in the logs. this is the same task processed on prod, takes around 3.25 hours the same task processed on edge;takes around 3 minutes

Is there a difference in the resources used between the edge server and the production server?

ralf-berger commented 4 weeks ago

All workloads are actually running on a single machine and both instances have the same resource limits available. For some reason, coursemapper-kg-wp-pg in production is maxing out its current compute limit of 1 CPU core though. Seems like PostgreSQL is very busy doing … something.

rawaa123 commented 4 weeks ago

I understood the following, Since the application is fully utilizing the available CPU, any additional tasks or demands will experience delays or may get queued, resulting in slower processing times (e.g., concept extraction taking hours). could we consider the following suggestion: Increase CPU allocation by considering scaling up the resources available to coursemapper-kg-wp-pg in production and allocating more CPU cores.

Seems like PostgreSQL is very busy doing … something.

What is this? How can we figure out what is doing? this "… something"

ralf-berger commented 4 weeks ago

Each service has its own CPU limit. PostgreSQL seemed stuck using the equivalent of one CPU to 100 %, while the rest of the application appeared more or less idle. Are you running anything particularly complex (WHERE on unindexed columns, sub-queries, …) that would require a lot of computation from the database engine?

I would not give PostgreSQL more CPU, unless you're sure that it actually does need it.

How can we figure out what [PostgreSQL] is doing?

Maybe logging (slow) query timing on the side of the db and/or shorter timeouts on the worker side?

Did you observe the behaviour just this once? I'll try to adjust the health checks to check availability more often and terminate PostgreSQL sooner. Could you please create some load by processing multiple jobs?

rawaa123 commented 4 weeks ago

I am wondering how the same image using same recscources and configuation behavies differently on a single machine. I have no expereince with this could you please explain me this, if you have an idea?

Each service has its own CPU limit. PostgreSQL seemed stuck using the equivalent of one CPU to 100 %, while the rest of the application appeared more or less idle. Are you running anything particularly complex (WHERE on unindexed columns, sub-queries, …) that would require a lot of computation from the database engine?

Currently, we are facing an issue while trying to connect to the wikipedia_service to extract concepts. Retrieving the concepts from wikipedia_service takes hours on the live environment, whereas on the edge environment, it only takes minutes.

Previously, we encountered an issue with the wikipedia_service where the port was closed and terminated by the admin... etc.

We are running the same queries on the database in both the edge and live environments. However, the performance discrepancy persists. Could this be related to a port issue? Perhaps changing the port could help? (just a suggestion)

Also, are we using PostgreSQL only in the live environment but not in the edge environment?

Did you observe the behaviour just this once? I'll try to adjust the health checks to check availability more often and terminate PostgreSQL sooner. Could you please create some load by processing multiple jobs?

Yes we observed this just this once and only on production not edge or locally. I will try to request the creation of some KG because without them, we can't request recommendations

As a following up, it received the job correctly but the connection to wiki service to annotate the concepts is taking a very long time.

ralf-berger commented 4 weeks ago

I am wondering how the same image using same recscources and configuation behavies differently on a single machine.

Things I can think of:

Different driver version on the client side
Different input data
Different db data (e. g. indices missing)

Do you think it makes sense to try removing the downloaded Wikipedia data and see if a re-import helps?

Previously, we encountered an issue with the wikipedia_service where the port was closed and terminated by the admin... etc.

That sounds like a client error saying that the connection was dropped? That should simply cause the worker to terminate and retry the task once the DB is back.

Also, are we using PostgreSQL only in the live environment but not in the edge environment?

No, the workloads are identical. coursemapper-kg-wp-pg consists of an init container for data import and once it's finished, the data is served by a stock postgres:16 container.

rawaa123 commented 4 weeks ago

Things I can think of:

Different driver version on the client side

Different input data

Different db data (e. g. indices missing)

1- Do we have two dirver versions for edge and production? if yes could we have the same version as on edge? 2- if you mean about the input data, the materials that we annotate, then yes they are identical 3-db wiki dump is only one and we connect to this through an api. If we are connecting smoothly in edge so i dont think the problem in the dump rather in the connection to this dump and only in production. One more thing, could we give PostgreSQL more CPU, and see how behaves. maybe this will faster the process, because we are in need to generate the kg in production.

Do you think it makes sense to try removing the downloaded Wikipedia data and see if a re-import helps?

will this faster the process, if so then we can try it,

ralf-berger commented 4 weeks ago

Is psychopspg the driver? I see version 3.2.1 in main and the same in v4.0.0, so that's probably not it.
Yes, but is there a possibility, that besides importing the data, PostgreSQL has to do e.g. some asynchronous indexing that might be interrupted when the init container quits? I think we should give wiping the data a try.

I increased it to two full CPU cores yesterday and the process is now continuously using 100% of that. 🤯

rawaa123 commented 4 weeks ago

I increased it to two full CPU cores yesterday and the process is now continuously using 100% of that.

The same problem. still taking hours extracting concepts from the wiki.

I was able to generate the KG for a material after a very long time and tried the recommendation, it works as expected.

2. I think we should give wiping the data a try.

if you mean the dump. please go ahead. Hope this solves the issue

ralf-berger commented 4 weeks ago

Done, please check if anything changed.

ralf-berger commented 4 weeks ago

Looks the same.

rawaa123 commented 4 weeks ago

Unfortunatly, it is not working.

ralf-berger commented 4 weeks ago

I'm giving it 5 CPU cores, hoping this will not compromise the reliability of any of the other apps.

rawaa123 commented 4 weeks ago

Thank you, but still the same😢

rawaa123 commented 3 weeks ago

I have recieved this msg in sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9jnj62

Dose this mean we are not able to generate the KG for now and we need to wait some time.

rawaa123 commented 3 weeks ago

@ralf-berger I would like to ask you to reverse the connections please. The connection that we are using to connect from concept map service to dump in edge, keep for production and vice versa for now, and later I will look after this issue. Will this be possible?

ralf-berger commented 3 weeks ago

@ralf-berger I would like to ask you to reverse the connections please. The connection that we are using to connect from concept map service to dump in edge, keep for production and vice versa for now, and later I will look after this issue. Will this be possible?

I don't understand?

rawaa123 commented 3 weeks ago

Could you confirm if the same database connection configuration "The connection to server at 10.43.171.86, port 5432" is being used in both the edge and production environments to access the Wikipedia dump?

ralf-berger commented 3 weeks ago

Could you confirm if the same database connection configuration "The connection to server at 10.43.171.86, port 5432" is being used in both the edge and production environments to access the Wikipedia dump?

No, those are two independent environments running separate workloads, available under separate Services in different Namespaces.

rawaa123 commented 3 weeks ago

I understand that edge and production are separate, independent environments, each with its own database connection configuration. Is it possible to switch the database connection configurations between these environments? Specifically, we could use the current edge connection at 10.43.171.86, port 5432 in production and move the existing production configuration to edge. This may serve as a potential solution to ensure functionality in the production environment. Does this approach make sense?

ralf-berger commented 3 weeks ago

I'm sorry, but that doesn't work at all. Those service addresses aren't even static, but resolved during runtime, depending on the namespace.

ralf-berger commented 3 weeks ago

Discussion about debugging PostgreSQL performance: #1219
Cause seems to be missing indices in production: #1224

ralf-berger commented 3 weeks ago

Manually created missing index:

CREATE INDEX page_categories_page_title ON page_categories (page_title)

@rawaa123 Is there any noticeable difference in query performance?

rawaa123 commented 3 weeks ago

Is there any noticeable difference in query performance?

Will check and back to you.

rawaa123 commented 3 weeks ago

This is what i got. I am not able to annotate.

ralf-berger commented 3 weeks ago

That seems to be an unrelated issue, dbpedia-spotlight-en getting killed due to lack of memory. Just a sec …

ralf-berger commented 3 weeks ago

I don't see the worker retrying the previously failed job.

rawaa123 commented 3 weeks ago

I don't see the worker retrying the previously failed job.

Now you can check

rawaa123 commented 3 weeks ago

same as before

rawaa123 commented 3 weeks ago

@ralf-berger congratulations, I can see right now it is working. Thank you so much for your help. Is this a temporal solution?

ralf-berger commented 3 weeks ago

It is a temporary fix for the current production environment, in the sense that we don't know why the index creation failed and if it might occur again in the future, e.g. when the dump is updated with new data. I also didn't attempt to create all missing db indices (see #1224), just the one that was causing the performance issue.

ude-soco / CourseMapper-webserver

Poor PostgreSQL performance in production #1212