Taskomatic Errata update appears to be hogging CPU usage via the Postgresql DB worker processes

ppanon2022 commented 1 month ago

Problem description

Postgresql processes are constantly hogging the CPU. Those processes are linked to the Taskomatic process with the com.redhat.rhn.manager.errata.cache.UpdateErrataCacheCommand method. See https://github.com/uyuni-project/uyuni/discussions/8978 for the investigation process followed

Steps to reproduce

We've got channels syncing repos for OpenSUSE 15, Rocky 8 and 9, Ubuntu 20.04 and Ubuntu 22.04, RHEL versions, and some old CentOS 7. I haven't yet been able to identify which channel is generating the Errata/Patch info which is causing the problem.
It is less likely to be OpenSUSE 15, Rocky 8, and CentOS7 or the Ubuntu channels because we have another server which also has those channels/repos and is not showing the same behaviour.
I thought I remembered an error in a sync log about information for some patch that was too long, but I can't find it again
restarting Taskomatic does not eliminate the problem. Whatever is generating the queries is not being cleared and the task restarts once Taskomatic is restarted. ...

Uyuni version

# zypper info Uyuni-Server-Release
Loading repository data...
Reading installed packages...

Information for package Uyuni-Server-release:
---------------------------------------------
Repository     : Uyuni Server Stable
Name           : Uyuni-Server-release
Version        : 2024.07-230900.219.1.uyuni3
Arch           : x86_64
Vendor         : obs://build.opensuse.org/systemsmanagement:Uyuni
Support Level  : Level 3
Installed Size : 1.4 KiB
Installed      : Yes
Status         : out-of-date (version 2024.05-230900.217.1.uyuni3 installed)
Source package : Uyuni-Server-release-2024.07-230900.219.1.uyuni3.src
Upstream URL   : https://www.uyuni-project.org/
Summary        : Uyuni Server
Description    :
    Uyuni lets you efficiently manage physical, virtual,
    and cloud-based Linux systems. It provides automated and cost-effective
    configuration and software management, asset management, and system
    provisioning.

Uyuni proxy version (if used)

No response

Useful logs

When using SELECT pg_cancel_backend(X) to stop the PG workers which are saturating the CPU usage,
the taskomatic log contains the following error, indicating the issue is with a Patch/Errata update process:

2024-08-08 04:11:26,388 [Thread-66] ERROR com.redhat.rhn.manager.errata.cache.UpdateErrataCacheCommand - Problem updating cache for server
com.redhat.rhn.common.db.WrappedSQLException: ERROR: canceling statement due to user request
Where: SQL statement "insert into rhnServerNeededCache
(server_id, errata_id, package_id, channel_id)
(select distinct sp.server_id, x.errata_id, p.id, x.channel_id
FROM (SELECT sp_sp.server_id, sp_sp.name_id,
sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
FROM rhnServerPackage sp_sp
join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp
join susePackageExcludingPartOfPtf p ON p.name_id = sp.name_id
join rhnPackageEvr pe ON pe.id = p.evr_id AND (sp.max_evr).type = (pe.evr).type AND sp.max_evr < pe.evr
join rhnPackageUpgradeArchCompat puac
ON puac.package_arch_id = sp.package_arch_id
AND puac.package_upgrade_arch_id = p.package_arch_id
join rhnServerChannel sc ON sc.server_id = sp.server_id
join rhnChannelPackage cp ON cp.package_id = p.id
AND cp.channel_id = sc.channel_id
left join (SELECT ep.errata_id, ce.channel_id, ep.package_id
FROM rhnChannelErrata ce
join rhnErrataPackage ep
ON ep.errata_id = ce.errata_id
join rhnServerChannel sc_sc
ON sc_sc.channel_id = ce.channel_id
WHERE sc_sc.server_id = server_id_in) x
ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id
left join rhnErrata e on x.errata_id = e.id
where sp.server_id = server_id_in
and (x.errata_id IS NULL or e.advisory_status != 'retracted') -- packages which are part of a retracted errata should not be installed
and NOT EXISTS (SELECT 1 FROM suseServerAppStreamHiddenPackagesView WHERE sid = server_id_in AND pid = p.id))"
PL/pgSQL function rhn_server.update_needed_cache(numeric) line 5 at SQL statement
at com.redhat.rhn.common.translation.SqlExceptionTranslator.sqlException(SqlExceptionTranslator.java:39) ~[rhn.jar:?]
at com.redhat.rhn.common.db.datasource.CachedStatement.lambda$executeCallable$5(CachedStatement.java:611) ~[rhn.jar:?]
at org.hibernate.jdbc.WorkExecutor.executeReturningWork(WorkExecutor.java:55) ~[hibernate-core.jar:5.3.25.Final]
at org.hibernate.internal.SessionImpl$2.accept(SessionImpl.java:2421) ~[hibernate-core.jar:5.3.25.Final]
at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.coordinateWork(JdbcCoordinatorImpl.java:306) ~[hibernate-core.jar:5.3.25.Final]
at org.hibernate.internal.SessionImpl.doWork(SessionImpl.java:2428) ~[hibernate-core.jar:5.3.25.Final]
at org.hibernate.internal.SessionImpl.doReturningWork(SessionImpl.java:2424) ~[hibernate-core.jar:5.3.25.Final]
at com.redhat.rhn.common.db.datasource.CachedStatement.doWithStolenConnection(CachedStatement.java:942) ~[rhn.jar:?]
at com.redhat.rhn.common.db.datasource.CachedStatement.executeCallable(CachedStatement.java:593) ~[rhn.jar:?]
at com.redhat.rhn.common.db.datasource.CallableMode.execute(CallableMode.java:44) ~[rhn.jar:?]
at com.redhat.rhn.domain.server.ServerFactory.updateServerNeededCache(ServerFactory.java:277) ~[rhn.jar:?]
at com.redhat.rhn.manager.errata.cache.UpdateErrataCacheCommand.processServer(UpdateErrataCacheCommand.java:195) ~[rhn.jar:?]
at com.redhat.rhn.manager.errata.cache.UpdateErrataCacheCommand.updateErrataCacheForServer(UpdateErrataCacheCommand.java:105) ~[rhn.jar:?]
at com.redhat.rhn.taskomatic.task.errata.ErrataCacheWorker.run(ErrataCacheWorker.java:63) ~[rhn.jar:?]
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java:732) ~[concurrent-1.3.4.jar:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.postgresql.util.PSQLException: ERROR: canceling statement due to user request
Where: SQL statement "insert into rhnServerNeededCache
(server_id, errata_id, package_id, channel_id)
(select distinct sp.server_id, x.errata_id, p.id, x.channel_id
FROM (SELECT sp_sp.server_id, sp_sp.name_id,
sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
FROM rhnServerPackage sp_sp
join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp
join susePackageExcludingPartOfPtf p ON p.name_id = sp.name_id
join rhnPackageEvr pe ON pe.id = p.evr_id AND (sp.max_evr).type = (pe.evr).type AND sp.max_evr < pe.evr
join rhnPackageUpgradeArchCompat puac
ON puac.package_arch_id = sp.package_arch_id
AND puac.package_upgrade_arch_id = p.package_arch_id
join rhnServerChannel sc ON sc.server_id = sp.server_id
join rhnChannelPackage cp ON cp.package_id = p.id
AND cp.channel_id = sc.channel_id
left join (SELECT ep.errata_id, ce.channel_id, ep.package_id
FROM rhnChannelErrata ce
join rhnErrataPackage ep
ON ep.errata_id = ce.errata_id
join rhnServerChannel sc_sc
ON sc_sc.channel_id = ce.channel_id
WHERE sc_sc.server_id = server_id_in) x
ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id
left join rhnErrata e on x.errata_id = e.id
where sp.server_id = server_id_in
and (x.errata_id IS NULL or e.advisory_status != 'retracted') -- packages which are part of a retracted errata should not be installed
and NOT EXISTS (SELECT 1 FROM suseServerAppStreamHiddenPackagesView WHERE sid = server_id_in AND pid = p.id))"
PL/pgSQL function rhn_server.update_needed_cache(numeric) line 5 at SQL statement
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2565) ~[postgresql.jar:42.2.25]
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2297) ~[postgresql.jar:42.2.25]
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:322) ~[postgresql.jar:42.2.25]
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481) ~[postgresql.jar:42.2.25]
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401) ~[postgresql.jar:42.2.25]
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164) ~[postgresql.jar:42.2.25]
at org.postgresql.jdbc.PgCallableStatement.executeWithFlags(PgCallableStatement.java:83) ~[postgresql.jar:42.2.25]
at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:153) ~[postgresql.jar:42.2.25]
at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:4519) ~[c3p0-0.9.5.5.jar:0.9.5.5]
at com.redhat.rhn.common.db.NamedPreparedStatement.execute(NamedPreparedStatement.java:121) ~[rhn.jar:?]
at com.redhat.rhn.common.db.datasource.CachedStatement.lambda$executeCallable$5(CachedStatement.java:599) ~[rhn.jar:?]
... 14 more

Additional information

The excessive CPU load (and perhaps record locking had caused an LCM project promotion to fail. Attempting to restart the project promotion also seems to fail.

ppanon2022 commented 1 month ago

Definitely a problem with the errata cache bunch task. The history for Bunch errata-cache-bunch shows most tasks getting skipped, the status is INTERRUPTED for the ones that were running when the server needed to be rebooted for updates. The currently running errata-cache task has been running for 41992 seconds.

I temporarily disabled the schedule for that task (actually changed it to once a month), restarted taskomatic, did a pg_cancel_backend on the backlogged/queued errata job's connection, and the CPU usage dropped as expected with no restart. However I'm guessing that we won't get errata/patch information as a result and we do want that, so some help in identifying what's causing the problem would be appreciated.

ppanon2022 commented 1 month ago

Note the information in the discussion 8978. aaannz suggested running a vacuum analyze on a number of tables - ANALYZE VERBOSE rhnServer, rhnServerPackage, rhnPackageEvr, rhnPackageUpgradeArchCompat, rhnServerChannel, rhnChannelPackage, rhnChannelErrata, rhnErrataPackage, rhnErrata, rhnServerNeededCache;

and there were thousands of NOTICE warnings during the analysis of the rhnpackageevr table. If those mean that the index is effectively unusable, requiring a table scan, then that could totally trash performance.

INFO: analyzing "public.rhnserver" INFO: "rhnserver": scanned 649 of 649 pages, containing 941 live rows and 51 dead rows; 941 rows in sample, 941 estimated total rows INFO: analyzing "public.rhnserverpackage" INFO: "rhnserverpackage": scanned 20745 of 20745 pages, containing 1912211 live rows and 19432 dead rows; 30000 rows in sample, 1912211 estimated total rows INFO: analyzing "public.rhnpackageevr" INFO: "rhnpackageevr": scanned 1403 of 1403 pages, containing 104016 live rows and 27 dead rows; 30000 rows in sample, 104016 estimated total rows NOTICE: comparing incompatible evr types. Using rpm NOTICE: comparing incompatible evr types. Using rpm NOTICE: comparing incompatible evr types. Using deb NOTICE: comparing incompatible evr types. Using deb ... Skipping thousands of those NOTICE: comparing incompatible evr types lines ... INFO: analyzing "public.rhnpackageupgradearchcompat" INFO: "rhnpackageupgradearchcompat": scanned 1 of 1 pages, containing 127 live rows and 0 dead rows; 127 rows in sample, 127 estimated total rows INFO: analyzing "public.rhnserverchannel" INFO: "rhnserverchannel": scanned 51 of 51 pages, containing 5065 live rows and 171 dead rows; 5065 rows in sample, 5065 estimated total rows

ppanon2022 commented 1 month ago

This appears to have been fixed in 2024.07

uyuni-project / uyuni