[BUG] RunHistory initialization failed: libffi.so.7: cannot open shared object file: No such file or directory

chrisbossard commented 3 years ago

Describe the bug While following the steps in Banana Tutorial under the section "Running the notebook" I am seeing this error appear:

"error": { "code": "ServiceError", "severity": null, "message": "AzureMLCompute job failed.\nServiceError: runTaskLetTask failed because: libffi.so.7: cannot open shared object file: No such file or directory\n\tReason: Job failed with non-zero exit Code", "messageFormat": null, "messageParameters": null, "referenceCode": null, "detailsUri": null, "target": null, "details": [], "innerError": null, "debugInfo": null, "additionalInfo": null }, "correlation": { "operation": "cf742550df05044dbf2b80b3397f31d7", "request": "8ae5bc2bda1f4e59" }, "environment": "australiaeast", "location": "australiaeast", "time": "2021-07-16T05:26:14.7993082+00:00", "componentName": "execution-worker"

} <

The note book I am running is SemanticsSegmentationUNet.ipynb

Screenshots MicrosoftTeams-image (3) MicrosoftTeams-image (2) MicrosoftTeams-image (1)

MaxStrange commented 3 years ago

Thanks for the repro steps! I'll try it out and see if I can reproduce the bug on our end.

MaxStrange commented 3 years ago

I can't seem to reproduce the bug. The banana model trains just fine on my end.

Did you start from a fresh GPU compute node or have you used it for other things before?

chrisbossard commented 3 years ago

This GPU compute node was only ever used for this, however I did delete it and create a new one for this trial. I still received the error on my end. This time I will attach the driver logs from the compute node

2021/07/17 03:42:03 Starting App Insight Logger for task: runTaskLet 2021/07/17 03:42:03 Version: 3.0.01650.0004 Branch: .SourceBranch Commit: 37e4354 2021/07/17 03:42:03 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info 2021/07/17 03:42:03 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status [2021-07-17T03:42:03.540749] Entering context manager injector. [context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train.py', '--data-path', 'DatasetConsumptionConfig:input', '--output-path', './outputs', '--epochs', '3', '--batch-size', '2', '--learning-rate', '0.001', '--scale', '0.5', '--to-bgr']) Script type = None [2021-07-17T03:42:04.053160] Entering Run History Context Manager. [2021-07-17T03:42:04.168754] Writing error with error_code ServiceError and error_hierarchy ServiceError/ImportError to hosttool error file located at /mnt/batch/tasks/workitems/8c9a8fec-f1de-4f51-b5e4-e7160b2e8b12/job-1/bananas-experiment_1_e21f5ce0-a830-471b-ba3f-bb22ff217ae2/wd/runTaskLetTask_error.json Starting the daemon thread to refresh tokens in background for process with pid = 82 Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_manager_injector.py", line 454, in execute_with_context(cm_objects, options.invocation) File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_manager_injector.py", line 132, in execute_with_context stack.enter_context(wrapper) File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/_vendor_contextlib2.py", line 356, in enter_context result = _cm_type.enter(cm) File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_manager_injector.py", line 80, in enter self.context_manager.enter() File "/mnt/batch/tasks/shared/LS_root/jobs/percept_poc_chris/azureml/bananas-experiment_1626492943_72d1e2e9/wd/azureml/bananas-experiment_1626492943_72d1e2e9/azureml-setup/context_managers.py", line 380, in enter self.history_context = get_history_context_manager(*self.history_config) File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/history/_tracking.py", line 179, in get_history_context_manager deny_list=deny_list + [USER_LOG_PATH]) File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/history/_tracking.py", line 367, in _get_run_for_context_managers from azureml.core.run import Run File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/init.py", line 13, in from .workspace import Workspace File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/workspace.py", line 22, in from azureml._project import _commands File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/_project/_commands.py", line 31, in from azureml.core.private_endpoint import PrivateEndPoint File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/private_endpoint.py", line 10, in from azureml.core.authentication import InteractiveLoginAuthentication File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/azureml/core/authentication.py", line 30, in from cryptography.fernet import Fernet File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/cryptography/fernet.py", line 16, in from cryptography.hazmat.primitives import hashes, padding File "/azureml-envs/azureml_2d7f1a40649837121a676c6b52ed54d2/lib/python3.6/site-packages/cryptography/hazmat/primitives/padding.py", line 11, in from cryptography.hazmat.bindings._padding import lib ImportError: libffi.so.7: cannot open shared object file: No such file or directory [2021-07-17T03:42:04.259850] Finished context manager injector with Exception. 2021/07/17 03:42:05 Succeeded to parse control script error: /mnt/batch/tasks/workitems/8c9a8fec-f1de-4f51-b5e4-e7160b2e8b12/job-1/bananas-experiment_1_e21f5ce0-a830-471b-ba3f-bb22ff217ae2/wd/runTaskLetTask_error.json to json 2021/07/17 03:42:05 Wrapper cmd failed with err: exit status 1 2021/07/17 03:42:05 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status 2021/07/17 03:42:05 mpirun version string: { Intel(R) MPI Library for Linux OS, Version 2018 Update 3 Build 20180411 (id: 18329) Copyright 2003-2018 Intel Corporation. } 2021/07/17 03:42:05 MPI publisher: intel ; version: 2018 2021/07/17 03:42:05 Not exporting to RunHistory as the exporter is either stopped or there is no data. Stopped: false OriginalData: 3 FilteredData: 0. 2021/07/17 03:42:05 Process Exiting with Code: 1 2021/07/17 03:42:05 All App Insights Logs was sent successfully or the close timeout of 20 was reached

Do you think this is an configuration issue of my Azure ML environment?

And apologies, I accidentally closed this issue.

MaxStrange commented 3 years ago

Sorry this has taken me a while, I'm pretty busy with some other high priority stuff as well. We are short-staffed this month. I'll try to get to this tomorrow.

SeryioGonzalez commented 3 years ago

The task would crash by including a pip dependency to azureml-sdk on the conda YAML supplied to the cluster in Azure ML:

name: train-env channels:

defaults
pytorch dependencies:
- python=3.6.2
- pytorch
- pillow
  - pip:
    - opencv-python
    - tensorboard - azureml-sdk

MaxStrange commented 3 years ago

Okay, the problem is that AML (or maybe Conda) tries to find the wrong version of libffi installed on the system. Specifically, the AML library is trying to open libffi.so.7, but only version 6 is installed system-wide. I added a line to the Dockerfile which will symlink Conda's libffi.so.7 to a location where the library can find it.

You will likely have to remove your cached version of the Docker image from the ACR that is linked against your AML workspace and then use my branch (until it gets merged in, which should be either today or tomorrow). Or just modify the Dockerfile in your workspace to include this line: RUN ln -s /opt/miniconda/lib/libffi.so.7 /usr/lib/x86_64-linux-gnu/libffi.so.7

The branch that includes this fix is here.

MaxStrange commented 3 years ago

The fix should be merged in now. Let me know if this fixes it on your end, and then I will close this issue.

MaxStrange commented 3 years ago

Closing this due to inactivity. Please feel free to reopen this issue if you still need it.

microsoft / azure-percept-advanced-development

[BUG] RunHistory initialization failed: libffi.so.7: cannot open shared object file: No such file or directory #48