Closed rawaa123 closed 3 weeks ago
Not sure what you're using for the worker queues, but you should probably check how to configure a reasonable timeout duration so that jobs don't get stuck forever. E.g. for Celery: https://docs.celeryq.dev/en/stable/userguide/workers.html#time-limits
Ideally, failed jobs should be retried only a couple of times. And critical issues (out of memory, db connection lost) should still terminate the worker process, so that it can be restarted.
Also, if that doesn't increase memory usage significantly, you might want to check if you can increase the number of jobs handled in parallel.
One more thing, Is it possible that i restart the service or should it always be done by you?
Well, services are restarted automatically whenever the container image or the configuration is updated, the process terminates itself for whatever reason or was killed, e.g. due to a memory leak making it hit its memory limit or its healthcheck failing. So there shouldn't really be any reason to ever terminate any processes manually.
Ideally, failed jobs should be retried only a couple of times. And critical issues (out of memory, db connection lost) should still terminate the worker process, so that it can be restarted.
You are right, this is implemented like this.
What we are not able to figure out is why it is stuck at one point? The thing is its working completely fine on edge, so how is production different from edge that it does not handle exactly the same thing in the same way as on edge? Is there any incosistency or difference in configuration etc? what could be the possible reason of this and how we can handle that?
What we are not able to figure out is why it is stuck at one point?
If the timeout fails with a stack trace, you should be able to get some idea what's happening by looking at the logs.
what could be the possible reason of this and how we can handle that?
I think we should make the service somewhat resilient (timeouts, traces, health checks, …) and find out if this was an edge case, related to connectivity, input data, etc., or if it happens regularly.
Is there any incosistency or difference in configuration etc?
Not in any way that seems relevant. The difference in the running Pods is literally just image versions, labels, timestamps and unique ids.
```diff 1,2c1,2 < Name: sc-coursemapper-webserver-coursemapper-kg-concept-map-f598q9gth < Namespace: cmw-edge --- > Name: sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9zffth > Namespace: cmw-prod 6c6 < Start Time: Wed, 23 Oct 2024 16:29:33 +0200 --- > Start Time: Thu, 17 Oct 2024 14:45:23 +0200 9d8 < branch=edge 11,12c10,11 < pod-template-hash=f598b8545 < stage=preview --- > pod-template-hash=86d986f954 > stage=prod 15c14 < IP: 10.42.0.109 --- > IP: 10.42.0.67 17,18c16,17 < IP: 10.42.0.109 < Controlled By: ReplicaSet/sc-coursemapper-webserver-coursemapper-kg-concept-map-f598b8545 --- > IP: 10.42.0.67 > Controlled By: ReplicaSet/sc-coursemapper-webserver-coursemapper-kg-concept-map-86d986f954 21,23c20,22 < Container ID: containerd://a8401648c48bf8a61609c7f7f190bec7c62e2e19ea3503d97d32221195270213 < Image: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map@sha256:08531dbbaeb8134b85bbdd57f03187994b42657da5b81e0e62192649fac90d2b < Image ID: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map@sha256:08531dbbaeb8134b85bbdd57f03187994b42657da5b81e0e62192649fac90d2b --- > Container ID: containerd://5b69209acf273708dc7e4c25bf50f0a0647d8ccae4686275f6472e7c68885a79 > Image: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map:4.0.0 > Image ID: ghcr.io/ude-soco/coursemapper-webserver-coursemapper-kg-concept-map@sha256:2c17f5f0d6b00a35201fd3454e326e4a266862613f2c850ecf402651395ab09b 27c26 < Started: Wed, 23 Oct 2024 16:29:33 +0200 --- > Started: Thu, 17 Oct 2024 14:51:12 +0200 48c47 < /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lfmw5 (ro) --- > /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vgw75 (ro) 56c55 < kube-api-access-lfmw5: --- > kube-api-access-vgw75: ```
@jeanqussa Please could you check at this issue?
@ralf-berger I can see that the process is working in production, but it is taking hours to complete, while in the edge environment, the same task takes only minutes. We have proper error handling in place; if a timeout occurs, it will be logged, and we will attempt to reassign the task. If it fails again, this will also be recorded in the logs. Currently, the main issue is the time taken to extract the concepts, that takes hours while minutes in edge minutes.
Why is the production environment so slow while the edge environment performs much faster? Do you have any suggestions?
@ralf-berger Following up on the above, the process doesn’t get stuck in production but takes significantly longer—hours instead of minutes, as it does in the edge environment. It regularly sends notifications each time it extracts concepts from the slide, along with the number of extracted concepts, as you can see in the logs. this is the same task processed on prod, takes around 3.25 hours the same task processed on edge;takes around 3 minutes
Is there a difference in the resources used between the edge server and the production server?
All workloads are actually running on a single machine and both instances have the same resource limits available. For some reason, coursemapper-kg-wp-pg in production is maxing out its current compute limit of 1 CPU core though. Seems like PostgreSQL is very busy doing … something.
I understood the following, Since the application is fully utilizing the available CPU, any additional tasks or demands will experience delays or may get queued, resulting in slower processing times (e.g., concept extraction taking hours). could we consider the following suggestion: Increase CPU allocation by considering scaling up the resources available to coursemapper-kg-wp-pg in production and allocating more CPU cores.
Seems like PostgreSQL is very busy doing … something.
What is this? How can we figure out what is doing? this "… something"
Each service has its own CPU limit. PostgreSQL seemed stuck using the equivalent of one CPU to 100 %, while the rest of the application appeared more or less idle. Are you running anything particularly complex (WHERE
on unindexed columns, sub-queries, …) that would require a lot of computation from the database engine?
I would not give PostgreSQL more CPU, unless you're sure that it actually does need it.
How can we figure out what [PostgreSQL] is doing?
Maybe logging (slow) query timing on the side of the db and/or shorter timeouts on the worker side?
Did you observe the behaviour just this once? I'll try to adjust the health checks to check availability more often and terminate PostgreSQL sooner. Could you please create some load by processing multiple jobs?
I am wondering how the same image using same recscources and configuation behavies differently on a single machine. I have no expereince with this could you please explain me this, if you have an idea?
Each service has its own CPU limit. PostgreSQL seemed stuck using the equivalent of one CPU to 100 %, while the rest of the application appeared more or less idle. Are you running anything particularly complex (
WHERE
on unindexed columns, sub-queries, …) that would require a lot of computation from the database engine?
Currently, we are facing an issue while trying to connect to the wikipedia_service to extract concepts. Retrieving the concepts from wikipedia_service takes hours on the live environment, whereas on the edge environment, it only takes minutes.
Previously, we encountered an issue with the wikipedia_service where the port was closed and terminated by the admin... etc.
We are running the same queries on the database in both the edge and live environments. However, the performance discrepancy persists. Could this be related to a port issue? Perhaps changing the port could help? (just a suggestion)
Also, are we using PostgreSQL only in the live environment but not in the edge environment?
Did you observe the behaviour just this once? I'll try to adjust the health checks to check availability more often and terminate PostgreSQL sooner. Could you please create some load by processing multiple jobs?
Yes we observed this just this once and only on production not edge or locally. I will try to request the creation of some KG because without them, we can't request recommendations
As a following up, it received the job correctly but the connection to wiki service to annotate the concepts is taking a very long time.
I am wondering how the same image using same recscources and configuation behavies differently on a single machine.
Things I can think of:
Do you think it makes sense to try removing the downloaded Wikipedia data and see if a re-import helps?
Previously, we encountered an issue with the wikipedia_service where the port was closed and terminated by the admin... etc.
That sounds like a client error saying that the connection was dropped? That should simply cause the worker to terminate and retry the task once the DB is back.
Also, are we using PostgreSQL only in the live environment but not in the edge environment?
No, the workloads are identical. coursemapper-kg-wp-pg
consists of an init container for data import and once it's finished, the data is served by a stock postgres:16
container.
Things I can think of:
- Different driver version on the client side
- Different input data
- Different db data (e. g. indices missing)
1- Do we have two dirver versions for edge and production? if yes could we have the same version as on edge? 2- if you mean about the input data, the materials that we annotate, then yes they are identical 3-db wiki dump is only one and we connect to this through an api. If we are connecting smoothly in edge so i dont think the problem in the dump rather in the connection to this dump and only in production. One more thing, could we give PostgreSQL more CPU, and see how behaves. maybe this will faster the process, because we are in need to generate the kg in production.
Do you think it makes sense to try removing the downloaded Wikipedia data and see if a re-import helps?
will this faster the process, if so then we can try it,
I increased it to two full CPU cores yesterday and the process is now continuously using 100% of that. 🤯
I increased it to two full CPU cores yesterday and the process is now continuously using 100% of that.
The same problem. still taking hours extracting concepts from the wiki.
I was able to generate the KG for a material after a very long time and tried the recommendation, it works as expected.
2. I think we should give wiping the data a try.
if you mean the dump. please go ahead. Hope this solves the issue
Done, please check if anything changed.
Looks the same.
Unfortunatly, it is not working.
I'm giving it 5 CPU cores, hoping this will not compromise the reliability of any of the other apps.
Thank you, but still the same😢
I have recieved this msg in sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9jnj62
Dose this mean we are not able to generate the KG for now and we need to wait some time.
@ralf-berger I would like to ask you to reverse the connections please. The connection that we are using to connect from concept map service to dump in edge, keep for production and vice versa for now, and later I will look after this issue. Will this be possible?
@ralf-berger I would like to ask you to reverse the connections please. The connection that we are using to connect from concept map service to dump in edge, keep for production and vice versa for now, and later I will look after this issue. Will this be possible?
I don't understand?
Could you confirm if the same database connection configuration "The connection to server at 10.43.171.86, port 5432" is being used in both the edge and production environments to access the Wikipedia dump?
Could you confirm if the same database connection configuration "The connection to server at 10.43.171.86, port 5432" is being used in both the edge and production environments to access the Wikipedia dump?
No, those are two independent environments running separate workloads, available under separate Services in different Namespaces.
I understand that edge and production are separate, independent environments, each with its own database connection configuration. Is it possible to switch the database connection configurations between these environments? Specifically, we could use the current edge connection at 10.43.171.86, port 5432 in production and move the existing production configuration to edge. This may serve as a potential solution to ensure functionality in the production environment. Does this approach make sense?
I'm sorry, but that doesn't work at all. Those service addresses aren't even static, but resolved during runtime, depending on the namespace.
Manually created missing index:
CREATE INDEX page_categories_page_title ON page_categories (page_title)
@rawaa123 Is there any noticeable difference in query performance?
Is there any noticeable difference in query performance?
Will check and back to you.
This is what i got. I am not able to annotate.
That seems to be an unrelated issue, dbpedia-spotlight-en getting killed due to lack of memory. Just a sec …
I don't see the worker retrying the previously failed job.
I don't see the worker retrying the previously failed job.
Now you can check
same as before
@ralf-berger congratulations, I can see right now it is working. Thank you so much for your help. Is this a temporal solution?
It is a temporary fix for the current production environment, in the sense that we don't know why the index creation failed and if it might occur again in the future, e.g. when the dump is updated with new data. I also didn't attempt to create all missing db indices (see #1224), just the one that was causing the performance issue.
@ralf-berger The worker for concept-map in production
sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9zffth
receives the job for KG generation but is still processing it since morning. Apparently there are no errors, but the job does not takes this long usually 2 min. Its kind of stuck and not proceeding any further. When I tried the same on edge, its working fine and efficiently.I sent the job today morning at 10:43 to generate KG, I tried with multiple requests afterwards but the new request is not received by the worker as its still processing the first one, but actually nothing is going on as I can see in the logs.
this is how its working on edge:
Can you please figure out if something is wrong with production environment that hinders KG generation? Morover, Could you please restart the service
sc-coursemapper-webserver-coursemapper-kg-concept-map-86d9zffth
. One more thing, Is it possible that i restart the service or should it always be done by you? If i can do it then how? could you please direct me.