Failure to run ClearML experiment (User aborted: stopping task (3))

mmartin9684-sil commented 1 year ago

A recurring error occurs when running experiments via ClearML on the AQuA server, in which the setup of the Docker container for the experiment aborts immediately after Poetry is installed and before the Python packages are installed.

Sample error log (experiment):

`2023-06-27 02:34:44 installing poetry from spec 'poetry==1.2.2'... 2023-06-27 02:34:49 installed package poetry 1.2.2, installed using Python 3.8.10 These apps are now globally available

poetry done! ✨ 🌟 ✨ ?25hUsing environment access token CLEARML_AUTH_TOKEN=**** 2023-06-27 02:34:49 User aborted: stopping task (3)`

In the case of this specific experiment, the experiment was queued to run immediately after another experiment on the langtech_40gb queue. The preceding experiment saved checkpoints (2 x 5.5GB).

mmartin9684-sil commented 1 year ago

Submitting the same experiment a second time on the same queue (langtech_40gb) resulted in the same error. No other experiments were running at the time, and no experiments had been run for ~6-8 hours.

mmartin9684-sil commented 1 year ago

Submitting the same experiment a third time on a different queue (idx_40gb) resulted in the same error.

mmartin9684-sil commented 1 year ago

Similar experiments on langtech_40gb and idx_40gb queues that are encountering the same error.

Nearly identical experiments ran successfully yesterday.

Note that ClearML doesn't recognize this failure condition immediately. Both these experiments failed 20 minutes ago, and ClearML continues to report them as 'Running'.

mshannon-sil commented 1 year ago

Seems like the issue might be related to the clearml configuration. The final line before the error is ?25hUsing environment access token CLEARML_AUTH_TOKEN=********, and in successful runs the line that should follow is Current configuration (clearml_agent v1.2.4rc2, location: /tmp/clearml.conf): after which it should output the configuration to the log. I'm investigating why that's not happening and will provide an update when I find more information.

mmartin9684-sil commented 1 year ago

Another experiment ran into this error this morning (although most experiments seem to be running successfully).

This experiment was running the 'translate' script (rather than the 'experiment' script). Two other experiments were running concurrently. This was the second of 4 runs of the 'translate' script were in the same queue (serval_production); the first run succeeded, and the third run started successfully immediately after this failure.

mshannon-sil commented 1 year ago

Update, the issue is occurring right after clearml runs the command $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id [task_id]. I'm pretty sure that when it fails, it fails before it reaches any of our code since it aborts right at the beginning, often before even outputting the clearml configuration. And it doesn't seem likely that the problem is the configuration, because otherwise it shouldn't be succeeding at all.

It's possible there's an issue with the server that's causing the tasks to abort, maybe related to the shared mounted cache. Bryan Vold is working on getting me access to the server so that I can investigate this.

mmartin9684-sil commented 1 year ago

Is there any value in looking at updating the release level of our clearml-agent software? I notice that the release we're using for that is quite old (1.2.4rc2, May '22); the latest GA release is 1.5.2 (Mar '23). And the ClearML Server has been updated several times over the past year, most recently on May 28. I think it's on release 3.17 now.

mshannon-sil commented 1 year ago

Yeah I'll take a look at that as well. Even if it's not the cause of the problem, it's likely a good idea to upgrade that now.

bhartmoore commented 1 year ago

Just FYI - I received this error twice this morning running the experiment command (NLLB.1.3B.ne_BNBT-bap_BART.NT_test_OT and NLLB.1.3B.ne_BNBT-bap_BART.NT, the second of which I reset) when the queues were empty. One was on langtech_40gb and one was on idx_40gb. Both continued to think they were running after throwing the error, and appeared to be hung; it was only when I scrolled back up that I found the error:

?25lcreating virtual environment...
creating shared libraries...
upgrading shared libraries...
2023-07-07 06:19:31
installing poetry from spec 'poetry==1.2.2'...
2023-07-07 06:19:36
  installed package poetry 1.2.2, installed using Python 3.8.10
  These apps are now globally available
    - poetry
done! ✨ 🌟 ✨
?25hUsing environment access token CLEARML_AUTH_TOKEN=********
2023-07-07 06:19:36
User aborted: stopping task (3)
2023-07-07 06:19:37
Current configuration (clearml_agent v1.2.4rc2, location: /tmp/clearml.conf):
----------------------
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5

...etc - log continues and finally hangs at this point:

Executing task id [e9cc2751190747b297483ba38204fa5e]:
repository = https://github.com/sillsdev/silnlp.git
branch = tokenizer_updates_character_handling
version_num = 2632429c5e7c22a8fcfb0fe34f02892fda13365f
tag = 
docker_cmd = nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 -v /home/clearml/.clearml/hf-cache:/root/.cache/huggingface
entry_point = -m silnlp.nmt.experiment  FT-Nepal\NLLB.1.3B.ne_BNBT-bap_BART.NT_test_OT --mixed-precision --memory-growth --clearml-queue langtech_40gb --save-checkpoints
working_dir = .

Both tasks completed successfully when I started them again.

mshannon-sil commented 1 year ago

In my last push to the tokenizer_updates_character_handling branch, I finished updating the clearml version. Could you pull the most recent changes and see if that fixes it?

bhartmoore commented 1 year ago

After pulling the changes, I did see this error again this morning while running a translate command on an empty queue. The next attempt succeeded. It seemed to happen earlier than usual.

The command that resulted in the error was poetry run python -m silnlp.nmt.translate --checkpoint best --src-project NNRV --books GEN RUT 1SA EST JON --trg-iso scp_Deva --src-iso npi_Deva --clearml-queue langtech_40gb FT-Nepal\NLLB.1.3B.npi_HYNBT-scp_Hyolmo.NT

I've left the log on the server. The end is here -

Installing collected packages: urllib3, platformdirs, distlib, filelock, virtualenv, six, pathlib2, future, pyjwt, pyparsing, pyhocon, attrs, python-dateutil, PyYAML, psutil, pyrsistent, jsonschema, idna, chardet, certifi, requests, orderedmultidict, furl, clearml-agent
Successfully installed PyYAML-5.4.1 attrs-20.3.0 certifi-2023.5.7 chardet-4.0.0 clearml-agent-1.2.4rc2 distlib-0.3.6 filelock-3.12.2 furl-2.1.3 future-0.18.3 idna-2.10 jsonschema-3.2.0 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.8.1 psutil-5.8.0 pyhocon-0.3.60 pyjwt-2.0.1 pyparsing-2.4.7 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.25.1 six-1.15.0 urllib3-1.26.16 virtualenv-20.23.1
Using environment access token CLEARML_AUTH_TOKEN=********
2023-07-11 11:11:17
User aborted: stopping task (3)

mshannon-sil commented 1 year ago

Okay thanks for the update. I just got access to the server at the end of last week, so I think that the next step will be upgrading the clearml-agent on the server. I'll get to that once I finish with the upgrades to the tokenizer for the upcoming workshop.

bhartmoore commented 1 year ago

Still getting this on occasion even after the updates. This was a translate command; here's the link to the task on ClearML. https://app.sil.hosted.allegro.ai/projects/07c343e876934726ad48cc6583d90c2d/experiments/a06e86901d4c4f8586ee136d39fe3bbf/output/log

mmartin9684-sil commented 1 year ago

Several more recent (15 July) experiments failed with this error:

https://app.sil.hosted.allegro.ai/projects/*/experiments/e63341dee6324e11aa2889bc87688ba9/info-output/log?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&columns=id&order=-last_update&filter=users:1c21fba2842945bd9fc37f9bd87686ee&deep=true
https://app.sil.hosted.allegro.ai/projects/*/experiments/ee1a1f5c689d4f2fad9c923f239fb910/info-output/log?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&columns=id&order=-last_update&filter=users:1c21fba2842945bd9fc37f9bd87686ee&deep=true These were run with the latest update to the branch. The latest commit definitely has reduced the rate of this error, although it hasn't been eliminated.

mmartin9684-sil commented 1 year ago

Another example of the experiment setup failing with the "User aborted: stopping task (3)" error message and with a dump of the ClearML configuration: https://app.sil.hosted.allegro.ai/projects/*/experiments/fc5ddd77178d4fbd938f8f991da5a228/info-output/log?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&columns=id&order=-last_update&filter=users:1c21fba2842945bd9fc37f9bd87686ee&deep=true This failure happened today (July 18).

johnml1135 commented 1 year ago

This was fixed with the most recent released version of ClearML Agent - we haven't seen the error again thanks!

mshannon-sil commented 1 year ago

We're running into this issue again when the AQUA server reboots. There are two ways to start the clearml agents, there's a script that the user can call and a script that the server calls on reboot. The user-run script creates clearml agents with the latest version of clearml-agent (1.6.1), so that's not an issue. However, the reboot script creates clearml agents with an old version of clearml-agent (1.2.4rc2). We're not sure yet why this is happening, but we think it might be using a different python installation.

ddaspit commented 1 year ago

Has this been fixed?

mshannon-sil commented 1 year ago

Not yet, I'll take a look at fixing the reboot issue today.

mshannon-sil commented 1 year ago

The program was using a few isolated pip packages that were installed under the /home/clearml/.local/lib folder and included the older version of clearml-agent, which was overriding the newer clearml-agent package that was installed elsewhere. I backed the packages up into a .tar file, deleted them, and rebooted the server. It's now using the correct version of clearml on reboot, and experiments are running without issue.

sillsdev / silnlp

Failure to run ClearML experiment (User aborted: stopping task (3)) #173