opensafely-core / job-runner

A client for running jobs in an OpenSAFELY secure environment, requested via job-server (q.v.)
Other
4 stars 5 forks source link

Failure to trace state when preparing for reboot #723

Open inglesp opened 7 months ago

inglesp commented 7 months ago

I ran just jobrunner/stop and then just jobrunner/prepare-for-reboot at the start of the maintenance window for https://github.com/opensafely-core/sysadmin/issues/168.

Several tracebacks were logged to the screen. Unfortunately I didn't capture them before the server was rebooted, and so I do not have a complete record.

As far as I could tell, there was one traceback per job. The tracebacks were caught and logged from finish_current_job: https://github.com/opensafely-core/job-runner/blob/22f9fd5eb25280061d386178304d8de9e0174f83/jobrunner/tracing.py#L117-L130

And the exception message was: AttributeError: 'NonRecordingSpan' object has no attribute 'name'.

However I don't have a record of where the exception was raised from.

As far as I can tell, the logs do not indicate a problem with the stopping the job or changing the state, but only that the change of state could not be traced.

inglesp commented 7 months ago

The only place we look up .name on a span in our own code is here:

https://github.com/opensafely-core/job-runner/blob/22f9fd5eb25280061d386178304d8de9e0174f83/jobrunner/tracing.py#L261-L263

inglesp commented 7 months ago

I think this is probably caused by the prepare_for_reboot script not setting up tracing, meaning that we don't have a real tracer object. The jobrunner service does this by calling jobrunner.tracing.setup_default_tracing: https://github.com/opensafely-core/job-runner/blob/22f9fd5eb25280061d386178304d8de9e0174f83/jobrunner/service.py#L33

We should ensure that tracing is set up by this script (and any others), and we should consider being defensive against it not being set up, perhaps by writing a wrapper for get_tracer.