pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Error while trying to run on TPU from VM instance. #4896

Open listless-dude opened 1 year ago

listless-dude commented 1 year ago

❓ Questions and Help

I did set up XRT_TPU_CONFIG with the IP address of the TPU. This is my test.py script

import os
import torch
import torch_xla.core.xla_model as xm

os.environ['XRT_TPU_CONFIG'] = "tpu_worker;0;10.128.0.29:8470"

dev = xm.xla_device() ## Error while executing this line
t1 = torch.randn(3,3,device=dev)
t2 = torch.randn(3,3,device=dev)
print(t1 + t2)

Here's the error:

\2023-04-17 19:35:38.550666: F    5184 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1362\] Non-OK-status: session.Run({tensorflow::Output(result, 0)}, &outputs) status: UNIMPLEMENTED: method "RunStep" not implemented
\*\*\* Begin stack trace \*\*\*
tsl::CurrentStackTrace()
xla::XrtComputationClient::InitializeAndFetchTopology(std::string const&, int, std::string const&, tensorflow::ConfigProto const&)
xla::XrtComputationClient::InitializeDevices(std::unique_ptr\<tensorflow::tpu::TopologyProto, std::default_delete\<tensorflow::tpu::TopologyProto\> \>)
xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr\<tensorflow::tpu::TopologyProto, std::default_delete\<tensorflow::tpu::TopologyProto\> \>)
xla::ComputationClient::Create()

        xla::ComputationClient::Get()

        PyCFunction_Call
        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall

        _PyObject_GenericGetAttrWithDict
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        _PyEval_EvalCodeWithName
        PyEval_EvalCode

        PyRun_SimpleFileExFlags

        Py_BytesMain
        __libc_start_main

        End stack trace 

Aborted\

I don't know what am I doing wrong. Can someone give me a possible fix

vanbasten23 commented 1 year ago

Do you require specific PyTorch/XLA version or it is fine to use the most recent stable version (2.0)? If you are fine with version 2.0, can you remove the line os.environ['XRT_TPU_CONFIG'] = "tpu_worker;0;10.128.0.29:8470" and retry?

Also for XRT_TPU_CONFIG, as the name suggests, it uses the xrt runtime which we plan to drop the support in the near future.

listless-dude commented 1 year ago

I removed it, and did export PJRT_DEVICE=TPU, still got the same error.

AdSear commented 1 year ago

Same for me, @mr-oogway any update?

ManfeiBai commented 1 year ago

Hi, @vanbasten23 , is that ok to assign this to you?

vanbasten23 commented 1 year ago

I tried your script on https://colab.sandbox.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb and the script runs fine on the colab.