packit / deployment

Ansible playbooks and scripts for deploying packit-service to OpenShift
MIT License
8 stars 25 forks source link

Customize shm size in postgres pod #567

Closed majamassarini closed 7 months ago

majamassarini commented 7 months ago

By default shm size is 64MB, dashboard usage page uses around 100MB.

I can reproduce the error described in packit/packit-service#2385 locally using podman compose and I can fix it by setting shm_size: 128Mi in docker-compose.yml.

I tried deploying these openshift changes on my local openshift cluster but even though the new volume is created and mapped the size didn't change.

df -h /dev/shm is always 64MB

I would like to test this on stg, to be sure the problem is not related with my local openshift cluster.

But, as far as I can understand it, does exist a SizeMemoryBackedVolumes k8s feature gate, which is not enabled in the cluster by default (probably because this is a dangerous feature https://github.com/kubernetes/kubernetes/issues/119611). If this is not enabled in the cluster, resizing can not occur.

If this PR will not work on stg I think we have 2 solutions:

softwarefactory-project-zuul[bot] commented 7 months ago

Build failed. https://softwarefactory-project.io/zuul/t/packit-service/buildset/f1cc474cece943c4bdfd2ea358448fad

:heavy_check_mark: pre-commit SUCCESS in 1m 44s :x: deployment-tests RETRY_LIMIT in 7m 27s

softwarefactory-project-zuul[bot] commented 7 months ago

Build failed. https://softwarefactory-project.io/zuul/t/packit-service/buildset/e8de33af4b3f469e9ea99bc7984e4b27

:heavy_check_mark: pre-commit SUCCESS in 1m 46s :x: deployment-tests RETRY_LIMIT in 7m 29s

majamassarini commented 7 months ago

I would probably recommend backup of the stage database and dumping the production one in its place, cause I have doubts about reproducibility of this issue on the “small-scale” contents of the stage's database.

Luckily it is really easy to check. I just need to enter the postgres pod with rsh and run df -h /dev/shm. If the size is 64MB then we still have the problem and the PR is useless and we should decide for one of the other two solutions.

majamassarini commented 7 months ago

I would say that one of the options would be increasing the memory (though we don't have much left… and we're just running 2/2 workers; we have one more worker than usual for short-running, but it's still less than what I set up for the redis upgrade a month ago or so).

The problem seems not to be related with memory in general. My local pods had plenty of memory and until the shared memory was 64MB the exception occurred. If we are not able to increase the shared memory we can not solve it, I fear.

  • redesign db queries in usage page

that would be probably ideal as it is the only API endpoint causing issues; IMO it would be probably for better to go with raw SQL queries instead of ORM which adds complexity because of the abstractions present…

I remember having already worked on the raw queries once, at the time we decided to stay with the ORM. But if we have no other solutions, probably this will be the easiest way. And probably for the usage pages we can also use views...

majamassarini commented 7 months ago

This PR was not working because the deploy target is not mounting the created volume. If you mount manually the volume then the shm is resized.