opensafely-core / job-runner

A client for running jobs in an OpenSAFELY secure environment, requested via job-server (q.v.)
Other
3 stars 5 forks source link

Improve out of memory error given to users #725

Open lucyb opened 4 months ago

lucyb commented 4 months ago

I was doing some Codespaces user testing this week and we came across this error:

Job exited with an error: Job ran out of memory (limit was 4.00GB)

The researcher in question said that this happens intermittently locally and that when it does they close other running applications and tries again. Given that the researcher often had memory-related issues on their machine this was quite understandable, although was not the right thing to do in this instance.

Would it be possible to expand the error message text to make it clearer when the limit is coming from and what to do when it's hit?

bloodearnest commented 4 months ago

Sadly, its not easy to do this, at least it wasn't when we first implemented this, which is why we have this ambiguity.

Firstly, we don't know if we were OOM killed because we hit the per-job limit (default of 4G in opensafely-cli), or whether we've actually exhausted the entire memory available to the system. Both can and do occur, both locally in and production. e.g. locally, in a 4G codespaces, with the default concurrency of 2, if both running jobs together consume more than the entire 4G available (not unlikely), then it will be system level exhaustion, not job level. In fact, since the default job limit is 4G, and the codespaces is 4G, it will almost always be global system resources triggering the OOM kill.

Secondly, "where the limit is coming from" is different depending on the above. If it is the per-job limit, what to do is different depending on whether its running with opensafely-cli (where a user can change it) or production (where they cannot). If its the system limit, they need to decrease parallelisation or increase memory resources.

The text for this message comes from job-runner, which has to serve both use cases. It may be possible to detect we're running in local_run mode, and add additional text perhaps. We could add text linking to https://docs.opensafely.org/opensafely-cli/#managing-resources, for example.

However, the situation has slightly changed with the introduction of job-runner tracking per job metrics, possibly. In theory, we could include the last recorded memory usage of the job in the text, e.g. the text could maybe be something like

Job exited with an error: Job ran out of memory. It was using N Mb, the per-job limit was Y Mb, and the system free memory was Z Mb.

That would hopefully give enough information for the user to figure out where the limit is? The job-runner metrics system was not designed to be used in this way, so its a little awkward, but it should be possible

lucyb commented 4 months ago

That would hopefully give enough information for the user to figure out where the limit is?

Yes, if it's not too difficult to do that would be ideal I think. Thank you. You're giving the user enough information that they can understand why the problem occurred.

sebbacon commented 4 months ago

Even something as simple as Job exited with an error: Job ran out of memory. Either increase the limit with --limit, or write your code to use less memory would be a strict improvement, IMO.

bloodearnest commented 4 months ago

Even something as simple as Job exited with an error: Job ran out of memory. Either increase the limit with --limit, or write your code to use less memory would be a strict improvement, IMO.

I think it's better to link to the docs as I suggested. As --limit is often not the correct solution.