Job status still "running" for a while after error occurred

arildm commented 1 year ago

I tried running an annotation job for a corpus with a PDF file with unreadable text (16483.pdf). After 5 s, check-status returned this response, where errors and sparv_output reflect the error that occurred:

Response

```json { "status": "success", "message": "Job is running", "errors": "ERROR No text was found in the file '16483.pdf'! This file cannot be processed with Sparv. Please make sure that every PDF input file contains machine readable text.\n\n(file: 16483)", "sparv_output": "Job execution failed. See log messages above or logs/2023-07-11_14.08.13.372603.log for details.", "job_status": { "sync2sparv": "none", "sync2storage": "none", "sparv": "running", "korp": "none" }, "sparv_exports": [ "xml_export:pretty", "csv_export:csv", "stats_export:sbx_freq_list" ], "available_files": [ { "name": "16483.pdf", "type": "application/pdf", "last_modified": "2023-07-11T14:08:01+02:00", "size": 115417, "path": "16483.pdf" } ], "installed_korp": false, "current_process": "sparv", "seconds_taken": 5.567173, "last_run_started": "2023-07-11T14:08:12+02:00", "progress": "3%" } ```

Since job_status.sparv is still "running", the frontend will not show any errors, and will keep polling check-status. After 25 s, it returned this response, where job_status is finally updated:

Response

```json { "status": "success", "message": "An error occurred during processing", "errors": "ERROR No text was found in the file '16483.pdf'! This file cannot be processed with Sparv. Please make sure that every PDF input file contains machine readable text.\n\n(file: 16483)", "sparv_output": "Job execution failed. See log messages above or logs/2023-07-11_14.08.13.372603.log for details.", "job_status": { "sync2sparv": "none", "sync2storage": "none", "sparv": "error", "korp": "none" }, "sparv_exports": [ "xml_export:pretty", "csv_export:csv", "stats_export:sbx_freq_list" ], "available_files": [ { "name": "16483.pdf", "type": "application/pdf", "last_modified": "2023-07-11T14:08:01+02:00", "size": 115417, "path": "16483.pdf" } ], "installed_korp": false, "current_process": "sparv", "last_run_started": "2023-07-11T14:08:12+02:00", "progress": "3%" } ```

Only now does the error show in the frontend.

Could we have the job_status update sooner? Or do you think the frontend should use errors (or some other part of the response) to determine whether an error has happened?

(Bonus question: shouldn't we have seconds_taken and last_run_ended there as well?)

anne17 commented 1 year ago

Unfortunately I cannot reproduce this behaviour. Do you still have the corpus where this error occurred? Then I could try to run it with the exact same configuration and files. I tried uploading and running just the file you posted and for me the process quit quickly (after a few seconds).

I need some more time to think about the bonus questions :)

anne17 commented 1 year ago

The answer to the bonus question is: yes! I now changed the code so that seconds_taken and last_run_ended are included in the response when a process is running, finished successfully, or finished with an error.

arildm commented 1 year ago

I reproduced it now (a couple of times), it's mink-fnjxq5rb5l. Now, the error message does show, and I'm not sure why it wouldn't when I created this issue. The job_status.sparv is still "running" so the frontend keeps polling for a few seconds more (~9s, not 25s) but that's not really a problem, so I'm closing this issue. I guess Sparv/Mink BE needs to do some things after the error happens, before the job is done. Screenshots below.

I first get this: mink pdf running

And then soon this: mink pdf error

anne17 commented 1 year ago

Ah okay, thanks for the feedback! Yes, I think what happens is that the queue manager (which is run in regular intervals) needs to unqueue the job before its status changes. Maybe this could be improved... It would of course be better if the status changed immediately, but there might be cases where the Sparv process should keep running (i.e. in order to finish other things) despite some error occurring. For now I think we can live with a ~9 seconds delay, but 25 seconds seems too much. Not sure why it took so long that time... Let's keep an eye on it!

spraakbanken / mink-backend

Job status still "running" for a while after error occurred #74