Training works with nprocs = 1 but not nprocs = 8

❓ Questions and Help

Hi all, I am training a transformer model with custom dataset (~ 800mb data) in Colab (or Kaggle) TPU and my code works when setting nprocs = 1 but will hang when after reaching xm.mark_step() in first step and entering second step.

Unfortunately, I cannot run met.metrics_report() since it won't reach the first step though I could produce when nprocs = 1.

I tried troubleshooting by checking whether my computation graph randomly changes such that XLA will need to recompile before feeding into TPU devices. The model is pretty standard and other than dropout layer I don't see any problem at all. I also check pad all sentences to the same length such that their input tensor shape doesn't change in each batch. Also I make sure I do not use .item(), if or any so tensor don't need to passing back and forth etc...

Below are the codes related to the problem at hand, I believe the data loading part is fine as I am able to train when nprocs = 8?

The computation graph, pretty standard I just modify from some tutorials:

class TweetsClass(torch.nn.Module):
    def __init__(self):
        super(TweetsClass, self).__init__()
        self.l1 = tfm.AutoModel.from_pretrained("vinai/bertweet-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(768, NUM_LABELS)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

The training loop, also pretty standard (print the steps to see where the loop reaches when training):

## Define how loss is averaged out of the 8 TPUs
def reduce_fn(vals):
  # take average
  return sum(vals) / len(vals)   

# Define training loop function
def train_loop_fn(data_loader, model, optimizer, device, scheduler = None):
    tracker = xm.RateTracker()
    model.train() # Put model to training mode
    for bi, data in enumerate(data_loader):
        print("Start")
        start_time = time.time()

        print("Extract data")
        # Extract data
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)
        # Reset the gradent
        print("Zero Grad")
        optimizer.zero_grad()

        print("Model")
        # Pass ids, mask, token_type_ids to model 
        outputs = model(ids, mask, token_type_ids)

        print("Loss")
        # Create loss function (Cross Entropy loss for multi-label classification) and optimizer (using Adam optimizer)
        loss_function = torch.nn.CrossEntropyLoss()
        loss = loss_function(outputs, targets)   

        # Print every 20 steps
        if bi%20==0:
           # since the loss is on all 8 cores, reduce the loss values and print the average (as defined in reduce_fn)
           print('[xla:{}]({}) Loss={:.5f} Rate={:.2f} GlobalRate={:.2f} Time={}'.format(
           xm.get_ordinal(), bi, loss.item(), tracker.rate(),
           tracker.global_rate(), time.asctime()), flush=True)

        # loss_reduced = xm.mesh_reduce('loss_reduce',loss,reduce_fn) 
        # # master_print will only print once (not from all 8 cores)
        # xm.master_print(f'bi={bi}, loss={loss_reduced}')

        # Backprop ninja
        print("Backward")
        loss.backward()

        print("Step Optimizer")
        # Use PyTorch XLA optimizer stepping
        xm.optimizer_step(optimizer)

        print("Scheduler")
        if scheduler is not None:
            scheduler.step()

        # Update model params in XLA tensor
        print("Mark Step")
        xm.mark_step()

        end_time = time.time()
        print(f"Time for steps {bi}: {end_time - start_time}")

    # Set model to evaluation mode
    model.eval()

The main map function:

def map_fn(index, flags):
    torch.set_default_tensor_type('torch.FloatTensor')
    # Sets a common random seed - both for initialization and ensuring graph is the same
    torch.manual_seed(TORCH_SEED)

    # Acquires the (unique) Cloud TPU core corresponding to this process's index
    device = xm.xla_device()  

    # Use one instance to download datasets 
    if not xm.is_master_ordinal():
        xm.rendezvous('download_only_once')

    train_dataset = pd.read_csv("train_set.csv")
    val_dataset =  pd.read_csv("dev_set.csv")

    if not xm.is_master_ordinal():
        xm.rendezvous('download_only_once')

    tokenizer = tfm.AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False, return_tensors='pt')
    # Use dataloader #
    train_set = TweetsData(train_dataset, tokenizer, MAX_LEN)
    val_set = TweetsData(val_dataset, tokenizer, MAX_LEN)

    # Training dataset loader #
    # Wrap our Class imbalance Sampler with DistributedSamplerWrapper
    train_sampler = DistributedSamplerWrapper(
        sampler=BalanceClassSampler(labels=train_dataset.label.values, mode='upsampling'),
        num_replicas=xm.xrt_world_size(),
        rank=xm.get_ordinal(),
        shuffle=True
    )

    train_loader = DataLoader(train_set,
        batch_size=TRAIN_BATCH_SIZE,
        sampler=train_sampler,
        num_workers=NUM_WORKERS_DATA,
        drop_last=True)

    val_loader = DataLoader(val_set,
        batch_size=VALID_BATCH_SIZE,
        sampler=val_sampler,
        num_workers=NUM_WORKERS_DATA,
        drop_last=True)

    # Push our neural network to TPU 
    model = TweetsClass()
    model.to(device)

    # Don't decay normalized layer
    param_optimizer = list(model.named_parameters()) # model parameters to optimize
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    # apply to weight decay
    optimizer_grouped_parameters = [
      {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
      {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]

    # Create loss function (Cross Entropy loss for multi-label classification) and optimizer (using Adam optimizer)
    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = AdamW(params = optimizer_grouped_parameters , lr = LEARNING_RATE * xm.xrt_world_size())

    # Create number of training steps
    num_train_steps = int(len(train_dataset) / TRAIN_BATCH_SIZE / xm.xrt_world_size() * EPOCHS) 

    # Scheduler for optimizer for learning decay
    scheduler = get_linear_schedule_with_warmup(
      optimizer,
      num_warmup_steps=0,
      num_training_steps=num_train_steps
    )

    xm.master_print(f"Train for {len(train_dataset)} steps per epoch")
    xm.master_print(f'num_training_steps = {num_train_steps}, world_size={xm.xrt_world_size()}')

    for epoch in range(EPOCHS):
        gc.collect()
        xm.master_print(f"Starting training in epoch: {epoch}")

        ## Training Part ##
        xm.master_print("Entering training loop")
        para_train_loader = pl.ParallelLoader(train_loader, [device]).per_device_loader(device)
        gc.collect()
        # Call Training Loop
        train_loop_fn(para_train_loader, model, optimizer, device, scheduler=scheduler)
        del para_train_loader
        gc.collect()

Calling XMP.SPAWN:

NUM_LABELS = 3
MAX_LEN = 128
TRAIN_BATCH_SIZE = 128
VALID_BATCH_SIZE = 128
EPOCHS = 1
LEARNING_RATE = 3e-05
NUM_WORKERS_DATA = 0
TORCH_SEED = 1234
flags = {}

xmp.spawn(map_fn, args=(flags,), nprocs=8, start_method='fork')

The outputs:

num_training_steps = 49368, world_size=8 Starting training in epoch: 0 Entering training loop Start Extract data Zero Grad Model Loss xla:0 Loss=1.10938 Rate=0.00 GlobalRate=0.00 Time=Mon May 10 16:42:54 2021 Backward Step Optimizer Scheduler Mark Step Time for steps 0: 63.811904430389404

It hangs and will not enter the second step, i.e. bi = 2 any ideas?

Many thanks.

In addition, I generated a metric report when using nprocs = 1 and pause at steps 8000. Most metrics are okay (coincide with the TROUBLESHOOTING.md page) - not enough to hang forever when using 8 TPU cores...

Metric: CompileTime
  TotalSamples: 4
  Accumulator: 01m11s310ms877.890us
  ValueRate: 587ms595.552us / second
  Rate: 0.032904 / second
  Percentiles: 1%=039ms677.169us; 5%=039ms677.169us; 10%=039ms677.169us; 20%=039ms677.169us; 50%=28s799ms018.834us; 80%=30s741ms459.477us; 90%=30s741ms459.477us; 95%=30s741ms459.477us; 99%=30s741ms459.477us
Metric: DeviceLockWait
  TotalSamples: 16628
  Accumulator: 47m09s263ms336.420us
  ValueRate: 713ms147.769us / second
  Rate: 4.18753 / second
  Percentiles: 1%=004.970us; 5%=005.394us; 10%=005.771us; 20%=006.294us; 50%=094.422us; 80%=343ms619.707us; 90%=343ms229.721us; 95%=344ms586.823us; 99%=344ms328.985us
Metric: ExecuteTime
  TotalSamples: 8356
  Accumulator: 48m48s278ms484.968us
  ValueRate: 728ms624.441us / second
  Rate: 2.11861 / second
  Percentiles: 1%=344ms692.878us; 5%=344ms395.605us; 10%=345ms701.419us; 20%=345ms094.356us; 50%=346ms876.655us; 80%=346ms369.038us; 90%=347ms636.245us; 95%=347ms816.508us; 99%=347ms184.226us
Metric: InboundData
  TotalSamples: 83
  Accumulator: 166.00B
  ValueRate: 0.04B / second
  Rate: 0.0211099 / second
  Percentiles: 1%=2.00B; 5%=2.00B; 10%=2.00B; 20%=2.00B; 50%=2.00B; 80%=2.00B; 90%=2.00B; 95%=2.00B; 99%=2.00B
Metric: InputOutputAliasCount
  TotalSamples: 4
  Accumulator: 1819.00
  ValueRate: 19.81 / second
  Rate: 0.0435521 / second
  Percentiles: 1%=1.00; 5%=1.00; 10%=1.00; 20%=1.00; 50%=807.00; 80%=807.00; 90%=807.00; 95%=807.00; 99%=807.00
Metric: IrValueTensorToXlaData
  TotalSamples: 204
  Accumulator: 01s403ms336.997us
  ValueRate: 02s642ms938.688us / second
  Rate: 238.685 / second
  Percentiles: 1%=991.060us; 5%=001ms052.837us; 10%=001ms096.129us; 20%=001ms280.126us; 50%=002ms672.283us; 80%=005ms914.610us; 90%=013ms700.124us; 95%=013ms052.544us; 99%=019ms185.107us
Metric: OutboundData
  TotalSamples: 27092
  Accumulator: 3.29GB
  ValueRate: 797.07KB / second
  Rate: 6.72177 / second
  Percentiles: 1%=2.00B; 5%=2.00B; 10%=2.00B; 20%=2.00B; 50%=2.00B; 80%=8.00B; 90%=8.00B; 95%=1.50MB; 99%=1.50MB
Metric: ReleaseDataHandlesTime
  TotalSamples: 636100
  Accumulator: 08m28s990ms393.613us
  ValueRate: 137ms796.870us / second
  Rate: 161.519 / second
  Percentiles: 1%=410.937us; 5%=472.362us; 10%=520.272us; 20%=574.704us; 50%=751.735us; 80%=001ms049.580us; 90%=001ms281.355us; 95%=002ms530.905us; 99%=002ms304.634us
Metric: TensorsGraphSize
  TotalSamples: 8356
  Accumulator: 88268384.00
  ValueRate: 22385.64 / second
  Rate: 2.11862 / second
  Percentiles: 1%=10648.00; 5%=10648.00; 10%=10648.00; 20%=10648.00; 50%=10648.00; 80%=10648.00; 90%=10648.00; 95%=10648.00; 99%=10648.00
Metric: TransferFromServerTime
  TotalSamples: 83
  Accumulator: 200ms838.969us
  ValueRate: 050.826us / second
  Rate: 0.0211099 / second
  Percentiles: 1%=001ms295.281us; 5%=001ms474.373us; 10%=002ms525.212us; 20%=002ms704.043us; 50%=002ms209.365us; 80%=002ms476.037us; 90%=003ms707.637us; 95%=006ms506.653us; 99%=006ms356.241us
Metric: TransferToServerTime
  TotalSamples: 27092
  Accumulator: 01m02s637ms749.132us
  ValueRate: 016ms541.418us / second
  Rate: 6.72181 / second
  Percentiles: 1%=969.059us; 5%=001ms117.791us; 10%=001ms227.731us; 20%=001ms402.467us; 50%=002ms867.552us; 80%=002ms271.086us; 90%=003ms549.381us; 95%=008ms922.102us; 99%=009ms467.359us
Metric: TransferToServerTransformTime
  TotalSamples: 27092
  Accumulator: 06s347ms777.563us
  ValueRate: 002ms618.820us / second
  Rate: 6.72177 / second
  Percentiles: 1%=041.313us; 5%=055.183us; 10%=061.279us; 20%=079.854us; 50%=154.392us; 80%=229.901us; 90%=307.764us; 95%=960.086us; 99%=002ms902.872us
Counter: CachedCompile
  Value: 8352
Counter: CreateCompileHandles
  Value: 4
Counter: CreateDataHandles
  Value: 6733933
Counter: CreateXlaTensor
  Value: 26457248
Counter: DestroyDataHandles
  Value: 6733166
Counter: DestroyXlaTensor
  Value: 26456405
Counter: DeviceDataCacheMiss
  Value: 16545
Counter: MarkStep
  Value: 16545
Counter: ReleaseDataHandles
  Value: 6733166
Counter: UncachedCompile
  Value: 4
Counter: XRTAllocateFromTensor_Empty
  Value: 29
Counter: XrtCompile_Empty
  Value: 128
Counter: XrtExecuteChained_Empty
  Value: 128
Counter: XrtExecute_Empty
  Value: 128
Counter: XrtMemoryInfo_Empty
  Value: 128
Counter: XrtRead_Empty
  Value: 128
Counter: XrtReleaseAllocationHandle_Empty
  Value: 128
Counter: XrtReleaseCompileHandle_Empty
  Value: 128
Counter: XrtSessionCount
  Value: 9
Counter: XrtSubTuple_Empty
  Value: 128
Counter: aten::_local_scalar_dense
  Value: 83
Counter: xla::_copy_from
  Value: 49842
Counter: xla::_log_softmax
  Value: 8273
Counter: xla::_log_softmax_backward_data
  Value: 8273
Counter: xla::_softmax
  Value: 99276
Counter: xla::_softmax_backward_data
  Value: 99276
Counter: xla::_unsafe_view
  Value: 794208
Counter: xla::add
  Value: 719751
Counter: xla::add_
  Value: 6229008
Counter: xla::addcdiv_
  Value: 1662712
Counter: xla::addcmul
  Value: 206825
Counter: xla::addcmul_
  Value: 1662712
Counter: xla::addmm
  Value: 24819
Counter: xla::as_strided
  Value: 204
Counter: xla::bernoulli_
  Value: 314374
Counter: xla::bmm
  Value: 595656
Counter: xla::cumsum
  Value: 8273
Counter: xla::div
  Value: 198552
Counter: xla::div_
  Value: 314374
Counter: xla::embedding
  Value: 24819
Counter: xla::embedding_dense_backward
  Value: 24819
Counter: xla::empty
  Value: 372891
Counter: xla::empty_strided
  Value: 204
Counter: xla::expand
  Value: 397104
Counter: xla::fill_
  Value: 8273
Counter: xla::gelu
  Value: 99276
Counter: xla::gelu_backward
  Value: 99276
Counter: xla::index_select
  Value: 24819
Counter: xla::mm
  Value: 1820060
Counter: xla::mul
  Value: 1472594
Counter: xla::mul_
  Value: 3325424
Counter: xla::native_batch_norm
  Value: 206825
Counter: xla::native_batch_norm_backward
  Value: 206825
Counter: xla::ne
  Value: 8273
Counter: xla::nll_loss_backward
  Value: 8273
Counter: xla::nll_loss_forward
  Value: 8273
Counter: xla::permute
  Value: 794208
Counter: xla::relu
  Value: 8273
Counter: xla::rsub
  Value: 8273
Counter: xla::select
  Value: 24819
Counter: xla::slice
  Value: 41365
Counter: xla::sqrt
  Value: 1662712
Counter: xla::sum
  Value: 1025852
Counter: xla::t
  Value: 2457081
Counter: xla::tanh
  Value: 8273
Counter: xla::threshold_backward
  Value: 8273
Counter: xla::transpose
  Value: 595656
Counter: xla::unsqueeze
  Value: 16546
Counter: xla::view
  Value: 6684584
Counter: xla::zero_
  Value: 1679620
Metric: XrtAllocateFromTensor
  TotalSamples: 62932
  Accumulator: 01m19s119ms463.458us
  Mean: 001ms262.476us
  StdDev: 645.279us
  Rate: 14.684 / second
  Percentiles: 25%=519.138us; 50%=001ms405.154us; 80%=002ms898.731us; 90%=002ms020.795us; 95%=002ms111.842us; 99%=002ms285.579us
Metric: XrtCompile
  TotalSamples: 8
  Accumulator: 02m13s859ms685.652us
  Mean: 17s607ms335.706us
  StdDev: 11s120ms475.388us
  Rate: 0.0102895 / second
  Percentiles: 25%=11s993ms628.971us; 50%=24s176ms871.533us; 80%=28s707ms958.173us; 90%=30s659ms194.501us; 95%=30s659ms194.501us; 99%=30s659ms194.501us
Metric: XrtExecute
  TotalSamples: 9240
  Accumulator: 49m34s163ms320.514us
  Mean: 341ms831.782us
  StdDev: 023ms447.833us
  Rate: 2.1186 / second
  Percentiles: 25%=343ms699.161us; 50%=343ms312.196us; 80%=344ms698.420us; 90%=344ms864.972us; 95%=344ms998.002us; 99%=344ms273.679us
Metric: XrtExecutorEvict
  TotalSamples: 0
  Accumulator: nanB
  Mean: nanB
  StdDev: nanB
  Percentiles: 
Metric: XrtReadLiteral
  TotalSamples: 380
  Accumulator: 178ms350.316us
  Mean: 469.343us
  StdDev: 103.655us
  Rate: 0.0828029 / second
  Percentiles: 25%=384.554us; 50%=470.285us; 80%=554.894us; 90%=604.092us; 95%=642.591us; 99%=730.619us
Metric: XrtReleaseAllocation
  TotalSamples: 675400
  Accumulator: 01m06s013ms726.930us
  Mean: 106.214us
  StdDev: 217.226us
  Rate: 161.513 / second
  Percentiles: 25%=016.334us; 50%=029.779us; 80%=105.503us; 90%=255.405us; 95%=495.802us; 99%=001ms177.625us

Like you say on single core it looks like training is working fine and performing well.

On the 8 core setup, could you try re-running with the following environment variables?

PT_XLA_DEBUG=1 TF_CPP_LOG_THREAD_ID=1 TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=tensor=5,computation_client=5,xrt_computation_client=5,aten_xla_type=1

That should tell us exactly what is happening within our framework when it's hanging.

Also I noticed that data_loader in your train loop fn is already a pl.ParallelLoader you don't need to manually do:

ids = data['ids'].to(device, dtype = torch.long) and similar for other data too.
xm.mark_step() is not needed as pl.ParallelLoader implicitly does this for every batch that it yields as described in our API_GUIDE.md: https://github.com/pytorch/xla/blob/6d91a3b72b731a304d639562724eaf4968655728/torch_xla/distributed/parallel_loader.py#L46
Instead of pl.ParallelLoader you can simply use: https://github.com/pytorch/xla/blob/6d91a3b72b731a304d639562724eaf4968655728/torch_xla/distributed/parallel_loader.py#L171

Many thanks, @jyshon23! I implement your suggestion and as a baseline, I confirmed working when nprocs = 1. However nprocs = 8 is still hanging. Though this time removing xm.mark_step() allows it to move on to second step but stuck at calling the model outputs = model(ids, mask, token_type_ids)

I expect by enabling the environment variables there will be some logs coming out from the interpreter?

os.environ['PT_XLA_DEBUG']="1"
os.environ['TF_CPP_LOG_THREAD_ID']="1"
os.environ['TF_CPP_MIN_LOG_LEVEL']="0"
os.environ['TF_CPP_VMODULE']="tensor=5"
os.environ['computation_client']="5"
os.environ['xrt_computation_client']="5"
os.environ['aten_xla_type']="1"

Thanks!

Hi! I have finally able to pull information out from enabling the environment variables by running the script in .py instead of running in .ipynb like: !PT_XLA_DEBUG=1 TF_CPP_LOG_THREAD_ID=1 TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=tensor=5 computation_client=5 xrt_computation_client=5 aten_xla_type=1 python -u train.py

Here are the results though it is not informative?

2021-05-17 23:29:57.812187: I    3547 tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:root:TPU has started up successfully with version pytorch-1.8.1
2021-05-17 23:30:00.692445: I    3560 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.706905: I    3564 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.713162: I    3570 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.724224: I    3576 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.734355: I    3580 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.744493: I    3584 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.754056: I    3588 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:25.099310: I    3576 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:26.269124: I    3570 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:26.974932: I    3564 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:27.845523: I    3584 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:28.775142: I    3588 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:28.874595: I    3560 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:29.900606: I    3580 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:37.928487: I    3559 tensorflow/core/profiler/rpc/profiler_server.cc:46] Profiler server listening on [::]:9012 selected port:9012
2021-05-17 23:30:46.457900: I    3559 torch_xla/csrc/tensor_util.cpp:28] Using BF16 data type for floating point values
Train for 12638343 steps per epoch
num_training_steps = 49368, world_size=8
Starting training in epoch: 0
Entering training loop
2021-05-17 23:30:51.221439: I    3559 torch_xla/csrc/tensor.cpp:1373] 204 live tensors: devices=()
2021-05-17 23:30:51.221548: I    3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:30:51.221591: I    3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:30:51.272911: I    3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash f8539b4c16d32784d76c0592dedb929d on device TPU:0
2021-05-17 23:30:51.274962: I    3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash e9be56370f81f759e99c0d5c7da37fa2
2021-05-17 23:30:51.357767: I    3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 ...
2021-05-17 23:30:51.404462: I    3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 done!
2021-05-17 23:30:51.405821: I    3559 torch_xla/csrc/tensor.cpp:1563] Graph hash e9be56370f81f759e99c0d5c7da37fa2 is computation hash e622f901bc78b50e06778eb7026a41da
2021-05-17 23:30:51.406392: I    3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=1020
2021-05-17 23:30:51.410112: I    3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 ...
2021-05-17 23:30:51.419282: I    3693 torch_xla/csrc/tensor.cpp:1303] Executing IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 done!
Start
Extract data
Zero Grad
Model
Loss
2021-05-17 23:31:24.800329: I    3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:31:24.800419: I    3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:31:24.800470: I    3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash 2c69fbdc154c900d1b019feaa507cd74 on device TPU:0
2021-05-17 23:31:24.801801: I    3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e
2021-05-17 23:31:24.914196: I    3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 ...
2021-05-17 23:31:24.959115: I    3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 done!
2021-05-17 23:31:24.961928: I    3559 torch_xla/csrc/tensor.cpp:1563] Graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e is computation hash 986ddacef03810dea5d00e9ec233f56
2021-05-17 23:31:24.962524: I    3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=2273
2021-05-17 23:31:24.962581: I    3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 ...
2021-05-17 23:31:25.021284: I    3693 torch_xla/csrc/tensor.cpp:1303] Executing IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 done!
[xla:0](0) Loss=1.28125 Rate=0.00 GlobalRate=0.00 Time=Mon May 17 23:31:25 2021
Backward
2021-05-17 23:31:25.059904: I    3559 torch_xla/csrc/tensor.cpp:1373] 439 live tensors: devices=()
2021-05-17 23:31:25.059993: I    3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:31:25.060011: I    3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:31:25.061081: I    3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash bd01710a26633b903f90743d92f0327d on device TPU:0
2021-05-17 23:31:25.062721: I    3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash dd5421c93af5a320b6606bceae91e861
2021-05-17 23:31:25.132474: I    3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 ...
2021-05-17 23:31:25.206239: I    3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 done!
2021-05-17 23:31:25.211813: I    3559 torch_xla/csrc/tensor.cpp:1563] Graph hash dd5421c93af5a320b6606bceae91e861 is computation hash ae93d9c8850271bb1f23a071977519e0
2021-05-17 23:31:25.213132: I    3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=4522
2021-05-17 23:31:25.216170: I    3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 ...
Step Optimizer
2021-05-17 23:31:25.384339: I    3693 torch_xla/csrc/tensor.cpp:1303] Executing IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 done!
Scheduler
Start
Extract data
Zero Grad
Model
Loss
2021-05-17 23:31:25.448908: I    3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:31:25.448985: I    3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:31:25.449014: I    3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash c87bb48399127cedece4ca286d0e1ed2 on device TPU:0
2021-05-17 23:31:25.452055: I    3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash 83f75c15d76c2047dda16e371c8f7416
2021-05-17 23:31:25.520943: I    3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash 83f75c15d76c2047dda16e371c8f7416 on device TPU:0 ...
2021-05-17 23:31:25.590626: I    3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash 83f75c15d76c2047dda16e371c8f7416 on device TPU:0 done!
2021-05-17 23:31:25.597732: I    3559 torch_xla/csrc/tensor.cpp:1563] Graph hash 83f75c15d76c2047dda16e371c8f7416 is computation hash de01864c3fcb3971fbaa97117e5af3c9
2021-05-17 23:31:25.599633: I    3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=7809
2021-05-17 23:31:25.599766: I    3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash 83f75c15d76c2047dda16e371c8f7416 on device TPU:0 ...

Stuck at second step in loss... Anything I can provide more information? Many thanks

OK, thanks for the info! Looks like the graph execution is hanging as you can see in the last log. For us to be able to dump the graph that is hanging on execution by setting XLA_SAVE_TENSORS_FILE=path/to/file.hlo XLA_SAVE_TENSORS_FMT=hlo XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1.

What is weird though is that I don't see logs printed after "Loss" as I'd expect execution after "Backward" and "Mark Step" are logged.

Hi, this is the graph for the first step since it does not finish the second step I think the graph is not saved.

tensor_dump.hlo.zip

I am trying to make the script accessible perhaps one could reproduce the problem?

Thanks!

Hi, I have made the notebook accessible it should work straight away. Thanks.

https://colab.research.google.com/drive/1cOiQjaElTXsdxNVDhSMfvauibyNzkImW?usp=sharing

Hi... Is there anything I can do to understand the issue more? Thanks

Hi sorry for the slow response, running it myself currently.

You can also view the turned on debug logging under Runtime > View Runtime Logs: Screen Shot 2021-05-25 at 10 19 40 AM

Also, given that it looks like you're trying to train using HF transformers library have you given the trainer.py based scripts a try? We test the trainer based examples on a nightly basis so those should be less likely to have breakages. Example command we run nightly testing on for classification finetuning task:

python examples/pytorch/xla_spawn.py \
  --num_cores 8 \
  examples/pytorch/text-classification/run_glue.py \
    --logging_dir=./tensorboard-metrics \
    --task_name MNLI \
    --cache_dir ./cache_dir \
    --do_train \
    --do_eval \
    --num_train_epochs 3 \
    --max_seq_length 128 \
    --learning_rate 3e-5 \
    --output_dir MNLI \
    --overwrite_output_dir \
    --logging_steps 30 \
    --save_steps 3000 \
    --overwrite_cache \
    --tpu_metrics_debug \
    --model_name_or_path roberta-large \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16

Thanks, Daniel I will have a look at the trainer.py script. The reason for this script is to learn the underlying nuts and bolts for the task at hand rather than calling higher-level trainer.py to do the job.

Anyhow, apparently there is something inside TPU that is out of my expertise as this script works with 1 core.

Please, only when you have time, look into this issue as I am curious as to why this happens.

I shall try trainer.py now. Thanks again.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pytorch / xla

Training works with nprocs = 1 but not nprocs = 8 #2937

❓ Questions and Help