microsoft / vscode-tools-for-ai

Azure Machine Learning for Visual Studio Code, previously called Visual Studio Code Tools for AI, is an extension to easily build, train, and deploy machine learning models to the cloud or the edge with Azure Machine Learning service.
Other
321 stars 91 forks source link

Debug Jobs Timeout Error #2213

Open PickHub opened 8 months ago

PickHub commented 8 months ago

Does this occur consistently? Yes

Repro steps:

  1. Submit a job on Azure Machine Learning
  2. Head to the job on Azure ML portal and click "Debug and monitor" and run the "vscode" application.

Expected behavior: vscode connects to the running job Actual behavior: Failed to connect to the remote extension host server (Error: request to https://<redacted>.eastus.nodes.azureml.ms:8889/api/terminals?1698344869411 failed, reason: connect ETIMEDOUT <some_ip>:8889)

I'm following Debug jobs and monitor training progress to attach the vscode debugger to an AML job. I'd appreciate any help to fix this Timeout error and getting this to run.

The error appears after opening a new vscode window under "Debug and monitor" in the AML portal. It's displaying Installing VS Code server on <JOB_NAME> for a while before the error.

Error Message

Action: Resolver.resolve Error type: 70 Error Message: request to redacted:url failed, reason: connect ETIMEDOUT redacted:id

Version: 0.36.0 OS: darwin OS Release: 23.0.0 Product: Visual Studio Code Product Version: 1.83.1 Language: en

Call Stack ``` s extension.js:2:1985921 extension.js:2:2012910extension.js:2:2012910 ```

Code

#!/usr/bin/env python
from azure.ai.ml import command
from azure.ai.ml.entities import VsCodeJobService

def main() -> None:
    ml_client = new_ml_client()
    env = ml_client.environments.get("<ENV_NAME>", label="latest")

    command_job = command(
        code="./src",
        command="python -m debugpy --listen localhost:5678 --wait-for-client pipeline.py && sleep 10m",
        environment=env,
        compute="<COMPUTE>",
        services={
            "vscode": VsCodeJobService(nodes="all"),
            ),
        },
    )

    job = ml_client.jobs.create_or_update(
        job=command_job,
        experiment_name=experiment_name,
    )
    print(f"Submitted job at url: \n{job.studio_url}")

This is the Dockerfile for creating the Environment:

FROM ubuntu:20.04

# Set timezone
ENV TZ=America/New_York
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt-get update && \
    apt-get install -y openssh-server python3-pip 

RUN ln -s /usr/bin/python3 /usr/bin/python

COPY requirements.txt /tmp/requirements.txt

RUN pip install -r /tmp/requirements.txt

RUN pip install debugpy ipykernel
sevillal commented 8 months ago

@PickHub thanks for filing this issue.

  1. Could you please confirm if you are using a workspace secured with Private Link?
  2. Could you please send us the logs by following these instructions?
  3. Are you able to use Jupyter from the Azure ML Studio?
  4. Are you able to use terminals from Jupyter in the Azure ML Studio?
PickHub commented 8 months ago

Thanks for getting back @sevillal!

  1. Our workspace has public network access enabled from all networks. We do have a private endpoint connection setup. Is that going to be a problem?
  2. traces.txt
  3. No, the request times out: ERR_CONNECTION_TIMED_OUT
  4. See above

Additionally we spun up a VM in our VNET. Connecting to the job via SSH from that VM fails with:

Traceback (most recent call last):
  File "/home/azureuser/.azure/cliextensions/ml/azext_mlv2/manual/custom/_ssh_connector.py", line 118, in <module>
    SshConnector().connect_ssh()
  File "/home/azureuser/.azure/cliextensions/ml/azext_mlv2/manual/custom/_ssh_connector.py", line 49, in connect_ssh
    loop.run_until_complete(self._connect_ssh())
  File "/opt/az/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/azureuser/.azure/cliextensions/ml/azext_mlv2/manual/custom/_ssh_connector.py", line 63, in _connect_ssh
    async with websockets.client.connect(
  File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 629, in __aenter__
    return await self
  File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 647, in __await_impl_timeout__
    return await self.__await_impl__()
  File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 654, in __await_impl__
    await protocol.handshake(
  File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 325, in handshake
    raise InvalidStatusCode(status_code, response_headers)
websockets.exceptions.InvalidStatusCode: server rejected WebSocket connection: HTTP 403
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535
sevillal commented 8 months ago

Thanks @PickHub for your response.

Our workspace has public network access enabled from all networks. We do have a private endpoint connection setup. Is that going to be a problem? It should not be a problem, public network access should allow Jupyter and other services from anywhere.

I have some follow up questions:

  1. From the VM inside the VNET, are you able to connect using VS Code?
  2. From the VM inside the VNET, are you able to use Jupyter from the Azure ML Studio?
PickHub commented 8 months ago

@sevillal Thanks again for looking into this!

  1. I've only connected from the VM via ssh, but not with VS Code. I did get asked to validate the fingerprint (which didn't happen without the VM), but then got this error:
    Traceback (most recent call last):
    File "/home/azureuser/.azure/cliextensions/ml/azext_mlv2/manual/custom/_ssh_connector.py", line 118, in <module>
    SshConnector().connect_ssh()
    File "/home/azureuser/.azure/cliextensions/ml/azext_mlv2/manual/custom/_ssh_connector.py", line 49, in connect_ssh
    loop.run_until_complete(self._connect_ssh())
    File "/opt/az/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
    File "/home/azureuser/.azure/cliextensions/ml/azext_mlv2/manual/custom/_ssh_connector.py", line 63, in _connect_ssh
    async with websockets.client.connect(
    File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 629, in __aenter__
    return await self
    File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 647, in __await_impl_timeout__
    return await self.__await_impl__()
    File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 654, in __await_impl__
    await protocol.handshake(
    File "/home/azureuser/.local/lib/python3.10/site-packages/websockets/legacy/client.py", line 325, in handshake
    raise InvalidStatusCode(status_code, response_headers)
    websockets.exceptions.InvalidStatusCode: server rejected WebSocket connection: HTTP 403
    kex_exchange_identification: Connection closed by remote host
    Connection closed by UNKNOWN port 65535

    Would I be able to run VS Code from the shell that I'm using to ssh into our VM?

  2. No, that does not work either and just times out.
sevillal commented 8 months ago

@PickHub thank for those details.

  1. No, you will not be able to run VS Code from ssh shell, are you able to RDP or some other desktop sharing to the VM in your virtual network?

Are you open to having a triage call next? Please let me know your availability and I can schedule sometime.

PickHub commented 7 months ago

Hey @sevillal, sorry about the late response, I've been on vacation. Would Monday or Wednesday 8 or 9am PST work for you?

sevillal commented 7 months ago

Hey @PickHub , no worries, I hope you've had a great time. I've schedule sometime for next week.

sevillal commented 4 months ago

Closing due to inactivity, please reopen if needed.

PickHub commented 3 months ago

I'm now trying this with a Windows VM, with the following setting disabled in vscode: Image

But now the debug app is "not started":

Image

PickHub commented 3 months ago

@sevillal Could we re-open this, please.

sevillal commented 2 months ago

@PickHub reopening this issue. I have a question, is your job a multi-node job? Are you able to connect to a job running in a single node?

PickHub commented 1 month ago

Thanks for reopening @sevillal🙏 This is happening when running on a single Standard_D13_v2 (8 cores, 56 GB RAM, 400 GB disk) compute. Which would mean a single node, correct?

sevillal commented 1 month ago

@PickHub I think that's just the compute, do you mind sharing the YAML file you are using for starting the job?