uyuni-project / uyuni

Source code for Uyuni
https://www.uyuni-project.org/
GNU General Public License v2.0
434 stars 181 forks source link

Error loading Scheduled actions history after migrating to podman #9279

Open digdilem-work opened 1 month ago

digdilem-work commented 1 month ago

Problem description

I’ve now test migrated Uyuni 2024-08 from RPM to Podman

One problem I’m seeing on the new vm is that every page of the “Scheduled” menu times out, eventually leading to “The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.”

Whilst doing this, two cpus are at 100% with postgres threads.

These pages still work on the original server but are slow, it’s possible I’ve extended these timeouts somewhere – and the reason seems fairly clear.

In both “Completed Actions” and “Archived Actions”, the ranges go from “1 - 500 of 10,000 (0 selected)” (Web urls /rhn/schedule/CompletedActions.do and ArchivedActions.do)

That’s a suspiciously round figure, so it’s probably more – at least 20,000 logs going back years. I don’t need these logs and would like to remove them – I think that should allow the container version to load those pages without timing out.

The Admin -> scheduled Tasks options are all completing okay; I had expected to see something in there to manage these logs.

How can I remove all but the past week or so of these logs in postgres, please?

Additionally – how do I stop them piling up in future?

(I did send this to the uyuni-users mailing list but got no reply after a couple of days, so retrying here)

Steps to reproduce

As above

Uyuni version

2024-08-Podman

Uyuni proxy version (if used)

No response

Useful logs

No response

Additional information

No response

mbussolotto commented 1 month ago

postgres logs can be found inside the container (so after running mgctl term) in /var/lib/pgsql/data/log/

digdilem-work commented 1 month ago

@mbussolotto Thank you. Those logs do have some errors - notably

2024-10-01 07:11:29.980 BST [unknown] [unknown] [29465]LOG:  could not accept SSL connection: unknown protocol
2024-10-01 07:11:29.980 BST [unknown] [unknown] [29465]HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.

However, these times don't match the issue I have which is when I manually click on any of the Schedules menu items. The Uyuni webui goes unresponsive for many minutes (at least six) until it eventually fails.

At this time, the cpu is at 100% for postgres

image

So it looks like postgres is simply taking too long to answer this query. The postgres database directory /var/lib/pgsql is 40Gb which feels very large and clear these tables have a LOT of historical information.

This works okay pre-migration to podman. Reponses via webui are a few seconds (less than 20), so something seems wrong.

I would like to prune that information - how do I do that, please?

digdilem-work commented 1 month ago

I restarted the vm since postgres was stuck at high load and tried again.

Webui -> Schedules (Expand) -> Failed Actions

After exactly 60 seconds, the webui fails with

image

Postgres continues using 100% CPU for one core for a long time.

mbussolotto commented 1 month ago

The postgres database directory /var/lib/pgsql is 40Gb which feels very large

This might be because of data-pg14: uyuni container use postgresql 16, so during migration db is upgraded as well (data-pg14 is the backup of the original data) can you please provide /var/lib/pgsql/data/postgresql.conf of the container (so after mgrctl term ) ?

digdilem-work commented 1 month ago

Thanks. I'm now fully post-migration to container and this is our live system.

Here's those contents, and a du of the db files, which are actually 5gb smaller than the old rpm-based system.

uyuni-server:/ # cat /var/lib/pgsql/data/postgresql.conf
max_connections = 400
shared_buffers = 3840MB
dynamic_shared_memory_type = posix
max_wal_size = 4096MB
min_wal_size = 1024MB
log_destination = 'stderr'
logging_collector = on
log_line_prefix = '%m %d %u [%p]'
log_timezone = 'Europe/London'
datestyle = 'iso, mdy'
timezone = 'Europe/London'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
effective_cache_size = 11264MB
work_mem = 10MB
maintenance_work_mem = 960MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
constraint_exclusion = off
wal_level = archive
max_wal_senders = 5
wal_keep_size = 1024
archive_mode = on
archive_command = '/bin/true'
max_locks_per_transaction = 100
standard_conforming_strings = 'off'
bytea_output = 'escape'
password_encryption = scram-sha-256
ssl = on
ssl_cert_file = '/etc/pki/tls/certs/spacewalk.crt'
ssl_key_file = '/etc/pki/tls/private/pg-spacewalk.key'
listen_addresses = '*'
random_page_cost = 4
effective_io_concurrency = 2
uyuni-server:/var/lib/pgsql # du -shc *
4.0K    analyze_new_cluster.sh
20G     data
20G     data-pg13
1.5G    data-pg14
4.0K    delete_old_cluster.sh
4.0K    initlog
40G     total
mbussolotto commented 1 month ago

Thanks, maybe it can explains the issue: we recently find out that postgresql.conf is not migrated during migration (I'm still looking at it, not sure which is the mgradm version that introduced the problem. If you have your old system, you can migrate manually your previous postgresql.conf file and see if it solves the issue. Otherwise you can try to change these values:

max_connections = 200
shared_buffers = 7680MB
effective_cache_size = 22528MB
work_mem = 40MB
maintenance_work_mem = 1024MB

EDIT:

IF you have data-pg14 folder in /var/lib/pgsql, you can copy the postgresq.conf from there (this is the original one present in your source sytem)

digdilem-work commented 1 month ago

Thank you for your time in trying to help me.

You were right here in that the conf hadn't transferred - the values were entirely different to the legacy server's.

Unfortunately, changing those values to both yours and the legacy ones didn't help. (I did reboot and check they had persisted)

The Pending/Failed/Completed schedules still cause things to timeout.

Strangely, the Archived table /does/ work, and contains many tens of thousands of entries. I'm currently looping through that in the webui and deleting 10k of these old schedules at a time. They go back several years.

Maybe once they're gone, postgres will have more luck displaying the others.

digdilem-work commented 1 month ago

k, so that's worked for me and I now have access to the scheduled tasks again.

I deleted hundreds of thousands of Archived Actions, 10k each. The database was surprisingly quick considering the numbers, and once the first 100k had gone, it got noticably more responsive. These dated back to early 2020 which is when I guess we moved to Uyuni from Spacewalk.

Once they were all gone, I could then open Failed Actions. There was another 7810 in there, which I could "Move to archive" and then "delete"

Finally, I started with "Completed Actions". Uyuni was now able to show this page without timing out and I've started the long process of Archiving these, and then will delete from from the Archived Actions. I've archived 200k Completeds so far.

Once everything is totally done, I'll try a vacuum to recover some db space.

So - in short, I have this working again. Thanks for your time.

Learning points:

  1. Almost certainly easier to delete these in the database.
  2. I need to find a way to automatically prune these going forwards. The reason this got so out of hand is that I assumed Uyuni was doing this itself and it clearly wasn't.
mbussolotto commented 1 month ago

@admd @mcalmer @rjmateus @cbosdo Besides the migration problem itself, I'm not aware of mechanism to prune really old actions...maybe is it something interesting that we can implement?

rjmateus commented 1 month ago

@mbussolotto @digdilem-work I'm not aware of any mechanism to archive and delete those actions. But that is a good point and we could consider develop a clean-up procedure for this kind of stuff. We already have one in-place to clean older user sessions from the database

digdilem-work commented 1 month ago

Thank you.

I think it is worth doing. I estimate we had around a million scheduled events accrued over four years, so a scheduled event to truncate from the database feels like a sensible housekeeping feature.

If it's useful, our ideal would be to keep events for 30 days, but I imagine others have different needs. A variable for expiry days value would be the best solution.

digdilem-work commented 1 month ago

Additional: If a developer is poking around in scheduled events anyway, can I raise the question about whether Archived Events are even needed?

Having spent several hours tidying up this mess on our system, having to "Archive" each event (in batches on up to 10,000) and then move across to the Archived interface and then select the same events to "Delete" them doubled my work.

Each event is sorted into Pending, Completed and Failed for effective archive anyway. I'm not sure what the benefit of them being moved to another database before getting rid.

Ideally (from my perspective), the workflow might go:

  1. Scheduled event appears in "Pending".
  2. Once actioned, it is then moved into either "Completed" or "Failed".
  3. After $variable_days, the event is permanently deleted.

Then the event is available, both in those lists and the per-machine Event History, for $variable_days, and reduces the chance of them mounting up and affecting performance.