Closed chrisbossard closed 3 years ago
Thanks for the repro steps! I'll try it out and see if I can reproduce the bug on our end.
I can't seem to reproduce the bug. The banana model trains just fine on my end.
Did you start from a fresh GPU compute node or have you used it for other things before?
This GPU compute node was only ever used for this, however I did delete it and create a new one for this trial. I still received the error on my end. This time I will attach the driver logs from the compute node
2021/07/17 03:42:03 Starting App Insight Logger for task: runTaskLet 2021/07/17 03:42:03 Version: 3.0.01650.0004 Branch: .SourceBranch Commit: 37e4354 2021/07/17 03:42:03 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info 2021/07/17 03:42:03 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status [2021-07-17T03:42:03.540749] Entering context manager injector. [context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train.py', '--data-path', 'DatasetConsumptionConfig:input', '--output-path', './outputs', '--epochs', '3', '--batch-size', '2', '--learning-rate', '0.001', '--scale', '0.5', '--to-bgr']) Script type = None [2021-07-17T03:42:04.053160] Entering Run History Context Manager. [2021-07-17T03:42:04.168754] Writing error with error_code ServiceError and error_hierarchy ServiceError/ImportError to hosttool error file located at /mnt/batch/tasks/workitems/8c9a8fec-f1de-4f51-b5e4-e7160b2e8b12/job-1/bananas-experiment_1_e21f5ce0-a830-471b-ba3f-bb22ff217ae2/wd/runTaskLetTask_error.json Starting the daemon thread to refresh tokens in background for process with pid = 82 Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_manager_injector.py", line 454, in
execute_with_context(cm_objects, options.invocation) File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_manager_injector.py", line 132, in execute_with_context stack.enter_context(wrapper) File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/_vendor_contextlib2.py", line 356, in enter_context result = _cm_type.enter(cm) File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_manager_injector.py", line 80, in enter self.context_manager.enter() File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_managers.py", line 380, in enter self.history_context = get_history_context_manager(*self.history_config) File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/history/_tracking.py", line 179, in get_history_context_manager deny_list=deny_list + [USER_LOG_PATH]) File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/history/_tracking.py", line 367, in _get_run_for_context_managers from azureml.core.run import Run File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/init.py", line 13, in from .workspace import Workspace File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/workspace.py", line 22, in OS, Version 2018 Update 3 Build 20180411 (id: 18329) Copyright 2003-2018 Intel Corporation. } 2021/07/17 03:42:05 MPI publisher: intel ; version: 2018 2021/07/17 03:42:05 Not exporting to RunHistory as the exporter is either stopped or there is no data. Stopped: false OriginalData: 3 FilteredData: 0. 2021/07/17 03:42:05 Process Exiting with Code: 1 2021/07/17 03:42:05 All App Insights Logs was sent successfully or the close timeout of 20 was reachedfrom azureml._project import _commands File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/_project/_commands.py", line 31, in from azureml.core.private_endpoint import PrivateEndPoint File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/private_endpoint.py", line 10, in from azureml.core.authentication import InteractiveLoginAuthentication File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/authentication.py", line 30, in from cryptography.fernet import Fernet File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/cryptography/fernet.py", line 16, in from cryptography.hazmat.primitives import hashes, padding File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/cryptography/hazmat/primitives/padding.py", line 11, in from cryptography.hazmat.bindings._padding import lib ImportError: libffi.so.7: cannot open shared object file: No such file or directory [2021-07-17T03:42:04.259850] Finished context manager injector with Exception. 2021/07/17 03:42:05 Succeeded to parse control script error: /mnt/batch/tasks/workitems/8c9a8fec-f1de-4f51-b5e4-e7160b2e8b12/job-1/bananas-experiment_1_e21f5ce0-a830-471b-ba3f-bb22ff217ae2/wd/runTaskLetTask_error.json to json 2021/07/17 03:42:05 Wrapper cmd failed with err: exit status 1 2021/07/17 03:42:05 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status 2021/07/17 03:42:05 mpirun version string: { Intel(R) MPI Library for Linux
Do you think this is an configuration issue of my Azure ML environment?
And apologies, I accidentally closed this issue.
Sorry this has taken me a while, I'm pretty busy with some other high priority stuff as well. We are short-staffed this month. I'll try to get to this tomorrow.
The task would crash by including a pip dependency to azureml-sdk on the conda YAML supplied to the cluster in Azure ML:
name: train-env channels:
Okay, the problem is that AML (or maybe Conda) tries to find the wrong version of libffi installed on the system. Specifically, the AML library is trying to open libffi.so.7, but only version 6 is installed system-wide. I added a line to the Dockerfile which will symlink Conda's libffi.so.7 to a location where the library can find it.
You will likely have to remove your cached version of the Docker image from the ACR that is linked against your AML workspace and then use my branch (until it gets merged in, which should be either today or tomorrow). Or just modify the Dockerfile in your workspace to include this line: RUN ln -s /opt/miniconda/lib/libffi.so.7 /usr/lib/x86_64-linux-gnu/libffi.so.7
The branch that includes this fix is here.
The fix should be merged in now. Let me know if this fixes it on your end, and then I will close this issue.
Closing this due to inactivity. Please feel free to reopen this issue if you still need it.
Describe the bug While following the steps in Banana Tutorial under the section "Running the notebook" I am seeing this error appear:
} <
The note book I am running is SemanticsSegmentationUNet.ipynb
Screenshots