osbuild / osbuild-composer

An HTTP service for building bootable OS images.
https://www.osbuild.org
Apache License 2.0
162 stars 107 forks source link

a new job request sometimes returns error 500 #1100

Open ondrejbudai opened 3 years ago

ondrejbudai commented 3 years ago

See the attached log. I think this fortunately fixes itself after some retries.

systemd[1]: Started OSBuild Composer Remote Worker (composer.osbuild.org:8700).
osbuild-worker[461968]: 2020/11/24 04:19:50 Composer configuration:
osbuild-worker[461968]: [koji]
osbuild-worker[461968]:   [koji."brewhub.stage.engineering.redhat.com"]
osbuild-worker[461968]:     [koji."brewhub.stage.engineering.redhat.com".kerberos]
osbuild-worker[461968]:       principal = "osbuild-automation@IPA.REDHAT.COM"
osbuild-worker[461968]:       keytab = "/etc/osbuild-worker/client.keytab"
osbuild-worker[461968]: Waiting for a new job...
osbuild-worker[461968]: Running 'osbuild-koji' job add0de98-e9f0-4eff-9c47-f7133b2b76bd
kernel: XFS (loop1): Mounting V5 Filesystem
kernel: XFS (loop1): Ending clean mount
kernel: XFS (loop1): Unmounting Filesystem
systemd[1]: Starting SSSD Kerberos Cache Manager...
systemd[1]: Started SSSD Kerberos Cache Manager.
sssd[kcm][483401]: Starting up
osbuild-worker[461968]: 2020/11/24 04:28:36 Job add0de98-e9f0-4eff-9c47-f7133b2b76bd finished
osbuild-worker[461968]: Waiting for a new job...
osbuild-worker[461968]: 2020/11/24 04:28:42 error requesting job: 500 — Internal Server Error
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Failed with result 'exit-code'.
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Service RestartSec=10s expired, scheduling restart.
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Scheduled restart job, restart counter is at 15.
systemd[1]: Stopped OSBuild Composer Remote Worker (composer.osbuild.org:8700).
systemd[1]: Started OSBuild Composer Remote Worker (composer.osbuild.org:8700).
osbuild-worker[483408]: 2020/11/24 04:28:52 Composer configuration:
osbuild-worker[483408]: [koji]
osbuild-worker[483408]:   [koji."brewhub.stage.engineering.redhat.com"]
osbuild-worker[483408]:     [koji."brewhub.stage.engineering.redhat.com".kerberos]
osbuild-worker[483408]:       principal = "osbuild-automation@IPA.REDHAT.COM"
osbuild-worker[483408]:       keytab = "/etc/osbuild-worker/client.keytab"
osbuild-worker[483408]: Waiting for a new job...
osbuild-worker[483408]: 2020/11/24 04:32:41 error requesting job: 500 — Internal Server Error
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Failed with result 'exit-code'.
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Service RestartSec=10s expired, scheduling restart.
systemd[1]: osbuild-remote-worker@composer.osbuild.org:8700.service: Scheduled restart job, restart counter is at 16.
systemd[1]: Stopped OSBuild Composer Remote Worker (composer.osbuild.org:8700).
systemd[1]: Started OSBuild Composer Remote Worker (composer.osbuild.org:8700).
osbuild-worker[483420]: 2020/11/24 04:32:51 Composer configuration:
osbuild-worker[483420]: [koji]
osbuild-worker[483420]:   [koji."brewhub.stage.engineering.redhat.com"]
osbuild-worker[483420]:     [koji."brewhub.stage.engineering.redhat.com".kerberos]
osbuild-worker[483420]:       principal = "osbuild-automation@IPA.REDHAT.COM"
osbuild-worker[483420]:       keytab = "/etc/osbuild-worker/client.keytab"
osbuild-worker[483420]: Waiting for a new job...
osbuild-worker[483420]: Running 'osbuild-koji' job ed0b7709-908b-4024-ab86-cb23348ab824
ondrejbudai commented 3 years ago

I managed to retrieve the error:

/api/worker/v1/jobs  channel was closed unexpectedly
ondrejbudai commented 3 years ago

https://github.com/osbuild/osbuild-composer/blob/01c21b5a47dc6d96e7a81b204902d904790e5e91/internal/jobqueue/fsjobqueue/fsjobqueue.go#L174

This is the cause, we're deleting channels that still have listeners. :-/