Open digdilem-work opened 1 month ago
postgres logs can be found inside the container (so after running mgctl term
) in /var/lib/pgsql/data/log/
@mbussolotto Thank you. Those logs do have some errors - notably
2024-10-01 07:11:29.980 BST [unknown] [unknown] [29465]LOG: could not accept SSL connection: unknown protocol
2024-10-01 07:11:29.980 BST [unknown] [unknown] [29465]HINT: This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.
However, these times don't match the issue I have which is when I manually click on any of the Schedules menu items. The Uyuni webui goes unresponsive for many minutes (at least six) until it eventually fails.
At this time, the cpu is at 100% for postgres
So it looks like postgres is simply taking too long to answer this query. The postgres database directory /var/lib/pgsql
is 40Gb which feels very large and clear these tables have a LOT of historical information.
This works okay pre-migration to podman. Reponses via webui are a few seconds (less than 20), so something seems wrong.
I would like to prune that information - how do I do that, please?
I restarted the vm since postgres was stuck at high load and tried again.
Webui -> Schedules (Expand) -> Failed Actions
After exactly 60 seconds, the webui fails with
Postgres continues using 100% CPU for one core for a long time.
The postgres database directory /var/lib/pgsql is 40Gb which feels very large
This might be because of data-pg14
: uyuni container use postgresql 16, so during migration db is upgraded as well (data-pg14
is the backup of the original data)
can you please provide /var/lib/pgsql/data/postgresql.conf
of the container (so after mgrctl term
) ?
Thanks. I'm now fully post-migration to container and this is our live system.
Here's those contents, and a du of the db files, which are actually 5gb smaller than the old rpm-based system.
uyuni-server:/ # cat /var/lib/pgsql/data/postgresql.conf
max_connections = 400
shared_buffers = 3840MB
dynamic_shared_memory_type = posix
max_wal_size = 4096MB
min_wal_size = 1024MB
log_destination = 'stderr'
logging_collector = on
log_line_prefix = '%m %d %u [%p]'
log_timezone = 'Europe/London'
datestyle = 'iso, mdy'
timezone = 'Europe/London'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
effective_cache_size = 11264MB
work_mem = 10MB
maintenance_work_mem = 960MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
constraint_exclusion = off
wal_level = archive
max_wal_senders = 5
wal_keep_size = 1024
archive_mode = on
archive_command = '/bin/true'
max_locks_per_transaction = 100
standard_conforming_strings = 'off'
bytea_output = 'escape'
password_encryption = scram-sha-256
ssl = on
ssl_cert_file = '/etc/pki/tls/certs/spacewalk.crt'
ssl_key_file = '/etc/pki/tls/private/pg-spacewalk.key'
listen_addresses = '*'
random_page_cost = 4
effective_io_concurrency = 2
uyuni-server:/var/lib/pgsql # du -shc *
4.0K analyze_new_cluster.sh
20G data
20G data-pg13
1.5G data-pg14
4.0K delete_old_cluster.sh
4.0K initlog
40G total
Thanks, maybe it can explains the issue: we recently find out that postgresql.conf
is not migrated during migration (I'm still looking at it, not sure which is the mgradm
version that introduced the problem. If you have your old system, you can migrate manually your previous postgresql.conf
file and see if it solves the issue. Otherwise you can try to change these values:
max_connections = 200
shared_buffers = 7680MB
effective_cache_size = 22528MB
work_mem = 40MB
maintenance_work_mem = 1024MB
EDIT:
IF you have data-pg14
folder in /var/lib/pgsql
, you can copy the postgresq.conf
from there (this is the original one present in your source sytem)
Thank you for your time in trying to help me.
You were right here in that the conf hadn't transferred - the values were entirely different to the legacy server's.
Unfortunately, changing those values to both yours and the legacy ones didn't help. (I did reboot and check they had persisted)
The Pending/Failed/Completed schedules still cause things to timeout.
Strangely, the Archived table /does/ work, and contains many tens of thousands of entries. I'm currently looping through that in the webui and deleting 10k of these old schedules at a time. They go back several years.
Maybe once they're gone, postgres will have more luck displaying the others.
k, so that's worked for me and I now have access to the scheduled tasks again.
I deleted hundreds of thousands of Archived Actions, 10k each. The database was surprisingly quick considering the numbers, and once the first 100k had gone, it got noticably more responsive. These dated back to early 2020 which is when I guess we moved to Uyuni from Spacewalk.
Once they were all gone, I could then open Failed Actions. There was another 7810 in there, which I could "Move to archive" and then "delete"
Finally, I started with "Completed Actions". Uyuni was now able to show this page without timing out and I've started the long process of Archiving these, and then will delete from from the Archived Actions. I've archived 200k Completeds so far.
Once everything is totally done, I'll try a vacuum to recover some db space.
So - in short, I have this working again. Thanks for your time.
Learning points:
@admd @mcalmer @rjmateus @cbosdo Besides the migration problem itself, I'm not aware of mechanism to prune really old actions...maybe is it something interesting that we can implement?
@mbussolotto @digdilem-work I'm not aware of any mechanism to archive and delete those actions. But that is a good point and we could consider develop a clean-up procedure for this kind of stuff. We already have one in-place to clean older user sessions from the database
Thank you.
I think it is worth doing. I estimate we had around a million scheduled events accrued over four years, so a scheduled event to truncate from the database feels like a sensible housekeeping feature.
If it's useful, our ideal would be to keep events for 30 days, but I imagine others have different needs. A variable for expiry days value would be the best solution.
Additional: If a developer is poking around in scheduled events anyway, can I raise the question about whether Archived Events are even needed?
Having spent several hours tidying up this mess on our system, having to "Archive" each event (in batches on up to 10,000) and then move across to the Archived interface and then select the same events to "Delete" them doubled my work.
Each event is sorted into Pending, Completed and Failed for effective archive anyway. I'm not sure what the benefit of them being moved to another database before getting rid.
Ideally (from my perspective), the workflow might go:
Then the event is available, both in those lists and the per-machine Event History, for $variable_days, and reduces the chance of them mounting up and affecting performance.
Problem description
I’ve now test migrated Uyuni 2024-08 from RPM to Podman
One problem I’m seeing on the new vm is that every page of the “Scheduled” menu times out, eventually leading to “The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.”
Whilst doing this, two cpus are at 100% with postgres threads.
These pages still work on the original server but are slow, it’s possible I’ve extended these timeouts somewhere – and the reason seems fairly clear.
In both “Completed Actions” and “Archived Actions”, the ranges go from “1 - 500 of 10,000 (0 selected)” (Web urls /rhn/schedule/CompletedActions.do and ArchivedActions.do)
That’s a suspiciously round figure, so it’s probably more – at least 20,000 logs going back years. I don’t need these logs and would like to remove them – I think that should allow the container version to load those pages without timing out.
The Admin -> scheduled Tasks options are all completing okay; I had expected to see something in there to manage these logs.
How can I remove all but the past week or so of these logs in postgres, please?
Additionally – how do I stop them piling up in future?
(I did send this to the uyuni-users mailing list but got no reply after a couple of days, so retrying here)
Steps to reproduce
As above
Uyuni version
Uyuni proxy version (if used)
No response
Useful logs
No response
Additional information
No response