qiime2 / galaxy-tools

Official QIIME 2 tools for Galaxy
BSD 3-Clause "New" or "Revised" License
1 stars 4 forks source link

More specific stdio parsing #63

Closed bernt-matthias closed 3 months ago

bernt-matthias commented 5 months ago

We got a report that dada2 crashed on some instance with the following in the stderr

raise Exception("An error was encountered while running DADA2"
Exception: An error was encountered while running DADA2 in R (return code -9), please inspect stdout and stderr to learn more.

the forums seem to suggest that this may indicate an out of memory (OOM) -- I did not check.

Depending of the job runner that is used this may be detected automatically by Galaxy, e.g. if SLURM is used.

But also the Galaxy could be annotated to help with detecting such cases: https://docs.galaxyproject.org/en/master/dev/schema.html#tool-stdio

Was wondering if we can accomodate for this by maintainging (manually curated) macro(s) that we can include in the autogenerated tools.

ebolyen commented 4 months ago

Would the goal be to report the OOM to the job executor for re-running? Or is there a better way to report OOM to the user based on the job executor?

bernt-matthias commented 4 months ago

Would the goal be to report the OOM to the job executor for re-running?

Yes. Rerunning can even be triggered automatically if the Galaxy admin has configured a job resubmission schema.

Or is there a better way to report OOM to the user based on the job executor?

I don't think so. The user will see the message if no resubmission is configured. Then the user has to ask the admin for more memory for the corresponding tool.

ebolyen commented 3 months ago

That's pretty cool! I don't think we have a good way to represent this at the moment.

Since QIIME 2 actions are generally run in-process, there's also not a good way to even handle sigkill. Which means that a mapping of exit codes wouldn't have any immediate use to us (outside of Galaxy) (and otherwise for trappable signals and normal exit codes, it's entirely in the purview of the plugin to handle and respond to).

@Oddant1 do you know if Parsl has any mechanism to care about these for tasks? I'm not sure what we would do in the event we saw this anyhow.

It's also important to us architecturally that plugins not know of the interface running them, so we'd need some unified reason to represent this exit code mapping (i.e. there won't be anything like a "Galaxy metadata" section we could stick this information).

I am going to tentatively close this as out of scope for us at the moment.

Oddant1 commented 3 months ago

@ebolyen, parsl has some mechanism for keeping track of the status of its tasks, and it also has a built-in retry system, but I think it's a bit more naive than galaxy's (I believe it just tries the exact same thing again and hopes whatever went wrong last time doesn't this time)