pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.47k stars 471 forks source link

RPC failed with status? #3975

Open Haris-Ali007 opened 2 years ago

Haris-Ali007 commented 2 years ago

RPC failed while running stylegan3 code on tpu-pytorch

Hello everyone. I was trying to shift the code of stylegan3 by NVLabs to TPU to speed up the processing. The code is being executed on the colab. I mostly commented the things that were causing problems in the original script and were not required in the initial stage. However, I am stuck in this issue and can't figure out the solution.

To Reproduce

Clone https://github.com/NVlabs/stylegan3 and paste the code in the training/training_loop.py

# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
#
# NVIDIA CORPORATION and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.

"""Generate images using pretrained network pickle."""

import os
import re
from typing import List, Optional, Tuple, Union

import click
import dnnlib
import numpy as np
import PIL.Image
import torch

import legacy

#----------------------------------------------------------------------------

def parse_range(s: Union[str, List]) -> List[int]:
    '''Parse a comma separated list of numbers or ranges and return a list of ints.

    Example: '1,2,5-10' returns [1, 2, 5, 6, 7]
    '''
    if isinstance(s, list): return s
    ranges = []
    range_re = re.compile(r'^(\d+)-(\d+)$')
    for p in s.split(','):
        m = range_re.match(p)
        if m:
            ranges.extend(range(int(m.group(1)), int(m.group(2))+1))
        else:
            ranges.append(int(p))
    return ranges

#----------------------------------------------------------------------------

def parse_vec2(s: Union[str, Tuple[float, float]]) -> Tuple[float, float]:
    '''Parse a floating point 2-vector of syntax 'a,b'.

    Example:
        '0,1' returns (0,1)
    '''
    if isinstance(s, tuple): return s
    parts = s.split(',')
    if len(parts) == 2:
        return (float(parts[0]), float(parts[1]))
    raise ValueError(f'cannot parse 2-vector {s}')

#----------------------------------------------------------------------------

def make_transform(translate: Tuple[float,float], angle: float):
    m = np.eye(3)
    s = np.sin(angle/360.0*np.pi*2)
    c = np.cos(angle/360.0*np.pi*2)
    m[0][0] = c
    m[0][1] = s
    m[0][2] = translate[0]
    m[1][0] = -s
    m[1][1] = c
    m[1][2] = translate[1]
    return m

#----------------------------------------------------------------------------

@click.command()
@click.option('--network', 'network_pkl', help='Network pickle filename', required=True)
@click.option('--seeds', type=parse_range, help='List of random seeds (e.g., \'0,1,4-6\')', required=True)
@click.option('--trunc', 'truncation_psi', type=float, help='Truncation psi', default=1, show_default=True)
@click.option('--class', 'class_idx', type=int, help='Class label (unconditional if not specified)')
@click.option('--noise-mode', help='Noise mode', type=click.Choice(['const', 'random', 'none']), default='const', show_default=True)
@click.option('--translate', help='Translate XY-coordinate (e.g. \'0.3,1\')', type=parse_vec2, default='0,0', show_default=True, metavar='VEC2')
@click.option('--rotate', help='Rotation angle in degrees', type=float, default=0, show_default=True, metavar='ANGLE')
@click.option('--outdir', help='Where to save the output images', type=str, required=True, metavar='DIR')
def generate_images(
    network_pkl: str,
    seeds: List[int],
    truncation_psi: float,
    noise_mode: str,
    outdir: str,
    translate: Tuple[float,float],
    rotate: float,
    class_idx: Optional[int]
):
    """Generate images using pretrained network pickle.

    Examples:

    \b
    # Generate an image using pre-trained AFHQv2 model ("Ours" in Figure 1, left).
    python gen_images.py --outdir=out --trunc=1 --seeds=2 \\
        --network=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl

    \b
    # Generate uncurated images with truncation using the MetFaces-U dataset
    python gen_images.py --outdir=out --trunc=0.7 --seeds=600-605 \\
        --network=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-t-metfacesu-1024x1024.pkl
    """

    print('Loading networks from "%s"...' % network_pkl)
    device = torch.device('cpu')
    with dnnlib.util.open_url(network_pkl) as f:
        G = legacy.load_network_pkl(f)['G_ema'].to(device) # type: ignore

    os.makedirs(outdir, exist_ok=True)

    # Labels.
    label = torch.zeros([1, G.c_dim], device=device)
    if G.c_dim != 0:
        if class_idx is None:
            raise click.ClickException('Must specify class label with --class when using a conditional network')
        label[:, class_idx] = 1
    else:
        if class_idx is not None:
            print ('warn: --class=lbl ignored when running on an unconditional network')

    # Generate images.
    for seed_idx, seed in enumerate(seeds):
        print('Generating image for seed %d (%d/%d) ...' % (seed, seed_idx, len(seeds)))
        z = torch.from_numpy(np.random.RandomState(seed).randn(1, G.z_dim)).to(device)

        # Construct an inverse rotation/translation matrix and pass to the generator.  The
        # generator expects this matrix as an inverse to avoid potentially failing numerical
        # operations in the network.
        if hasattr(G.synthesis, 'input'):
            m = make_transform(translate, rotate)
            m = np.linalg.inv(m)
            G.synthesis.input.transform.copy_(torch.from_numpy(m))

        img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
        img = (img.permute(0, 2, 3, 1) * 127.5 + 128).clamp(0, 255).to(torch.uint8)
        PIL.Image.fromarray(img[0].cpu().numpy(), 'RGB').save(f'{outdir}/seed{seed:04d}.png')

#----------------------------------------------------------------------------

if __name__ == "__main__":
    generate_images() # pylint: disable=no-value-for-parameter

#----------------------------------------------------------------------------

Error

  2022-09-07 07:32:12.549446: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1662535932.549287619","description":"Error received from peer ipv4:10.115.174.50:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
  2022-09-07 07:32:31.725774: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] StackTrace:
  2022-09-07 07:32:31.725906: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** Begin stack trace ***
  2022-09-07 07:32:31.725920: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_20211102::Span<xla::XlaComputation const* const>, absl::lts_20211102::Span<xla::Shape const* const>)
  2022-09-07 07:32:31.725936: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  xla::XrtComputationClient::CheckCompileStatus(tensorflow::Status const&, std::vector<xla::ComputationClient::CompileInstance, std::allocator<xla::ComputationClient::CompileInstance> > const&, xla::XrtComputationClient::SessionWork const&)
  2022-09-07 07:32:31.725948: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  
  2022-09-07 07:32:31.725958: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  xla::util::MultiWait::Complete(std::function<void ()> const&)
  2022-09-07 07:32:31.725967: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  
  2022-09-07 07:32:31.725977: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  
  2022-09-07 07:32:31.725986: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  
  2022-09-07 07:32:31.725995: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]  clone
  2022-09-07 07:32:31.726004: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** End stack trace ***
  2022-09-07 07:32:31.726013: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
  2022-09-07 07:32:31.726023: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Status: ABORTED: Session 79df9241f7f5cde7 is not found.
  Traceback (most recent call last):
    File "train.py", line 288, in <module>
      main() # pylint: disable=no-value-for-parameter
    File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
      return self.main(*args, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
      rv = self.invoke(ctx)
    File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
      return callback(*args, **kwargs)
    File "train.py", line 283, in main
      launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
    File "train.py", line 98, in launch_training
      subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
    File "train.py", line 49, in subprocess_fn
      training_loop.training_loop(rank=rank, **c)
    File "/content/stylegan3/training/training_loop.py", line 375, in training_loop
      snapshot_data[key] = value.cpu()
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 738, in cpu
      return self._apply(lambda t: t.cpu())
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 579, in _apply
      module._apply(fn)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 579, in _apply
      module._apply(fn)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 579, in _apply
      module._apply(fn)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 602, in _apply
      param_applied = fn(param)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 738, in <lambda>
      return self._apply(lambda t: t.cpu())
  RuntimeError: ABORTED: Session 79df9241f7f5cde7 is not found.
  terminate called after throwing an instance of 'std::runtime_error'
    what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1234 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (ABORTED: Session 79df9241f7f5cde7 is not found. vs. OK)
  *** Begin stack trace ***
      tensorflow::CurrentStackTrace()
      xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::string const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
      xla::XrtComputationClient::HandleReleaser()
      xla::util::TriggeredTask::Runner()

      clone
  *** End stack trace ***

System Info

JackCaoG commented 2 years ago

Sorry for the late reply

  2022-09-07 07:32:12.549446: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1662535932.549287619","description":"Error received from peer ipv4:10.115.174.50:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

means that there is a crash on the server side and given the log below it seems like it is happening in the graph compilating time. Since you are running on the colab it is using TPU Node which is an old architure the error message is really vague. The best way I can see to debug this issue is to run it on TPUVM(checkout https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm).