Closed gabrielwong1991 closed 3 years ago
In addition, I generated a metric report when using nprocs = 1 and pause at steps 8000. Most metrics are okay (coincide with the TROUBLESHOOTING.md page) - not enough to hang forever when using 8 TPU cores...
Metric: CompileTime
TotalSamples: 4
Accumulator: 01m11s310ms877.890us
ValueRate: 587ms595.552us / second
Rate: 0.032904 / second
Percentiles: 1%=039ms677.169us; 5%=039ms677.169us; 10%=039ms677.169us; 20%=039ms677.169us; 50%=28s799ms018.834us; 80%=30s741ms459.477us; 90%=30s741ms459.477us; 95%=30s741ms459.477us; 99%=30s741ms459.477us
Metric: DeviceLockWait
TotalSamples: 16628
Accumulator: 47m09s263ms336.420us
ValueRate: 713ms147.769us / second
Rate: 4.18753 / second
Percentiles: 1%=004.970us; 5%=005.394us; 10%=005.771us; 20%=006.294us; 50%=094.422us; 80%=343ms619.707us; 90%=343ms229.721us; 95%=344ms586.823us; 99%=344ms328.985us
Metric: ExecuteTime
TotalSamples: 8356
Accumulator: 48m48s278ms484.968us
ValueRate: 728ms624.441us / second
Rate: 2.11861 / second
Percentiles: 1%=344ms692.878us; 5%=344ms395.605us; 10%=345ms701.419us; 20%=345ms094.356us; 50%=346ms876.655us; 80%=346ms369.038us; 90%=347ms636.245us; 95%=347ms816.508us; 99%=347ms184.226us
Metric: InboundData
TotalSamples: 83
Accumulator: 166.00B
ValueRate: 0.04B / second
Rate: 0.0211099 / second
Percentiles: 1%=2.00B; 5%=2.00B; 10%=2.00B; 20%=2.00B; 50%=2.00B; 80%=2.00B; 90%=2.00B; 95%=2.00B; 99%=2.00B
Metric: InputOutputAliasCount
TotalSamples: 4
Accumulator: 1819.00
ValueRate: 19.81 / second
Rate: 0.0435521 / second
Percentiles: 1%=1.00; 5%=1.00; 10%=1.00; 20%=1.00; 50%=807.00; 80%=807.00; 90%=807.00; 95%=807.00; 99%=807.00
Metric: IrValueTensorToXlaData
TotalSamples: 204
Accumulator: 01s403ms336.997us
ValueRate: 02s642ms938.688us / second
Rate: 238.685 / second
Percentiles: 1%=991.060us; 5%=001ms052.837us; 10%=001ms096.129us; 20%=001ms280.126us; 50%=002ms672.283us; 80%=005ms914.610us; 90%=013ms700.124us; 95%=013ms052.544us; 99%=019ms185.107us
Metric: OutboundData
TotalSamples: 27092
Accumulator: 3.29GB
ValueRate: 797.07KB / second
Rate: 6.72177 / second
Percentiles: 1%=2.00B; 5%=2.00B; 10%=2.00B; 20%=2.00B; 50%=2.00B; 80%=8.00B; 90%=8.00B; 95%=1.50MB; 99%=1.50MB
Metric: ReleaseDataHandlesTime
TotalSamples: 636100
Accumulator: 08m28s990ms393.613us
ValueRate: 137ms796.870us / second
Rate: 161.519 / second
Percentiles: 1%=410.937us; 5%=472.362us; 10%=520.272us; 20%=574.704us; 50%=751.735us; 80%=001ms049.580us; 90%=001ms281.355us; 95%=002ms530.905us; 99%=002ms304.634us
Metric: TensorsGraphSize
TotalSamples: 8356
Accumulator: 88268384.00
ValueRate: 22385.64 / second
Rate: 2.11862 / second
Percentiles: 1%=10648.00; 5%=10648.00; 10%=10648.00; 20%=10648.00; 50%=10648.00; 80%=10648.00; 90%=10648.00; 95%=10648.00; 99%=10648.00
Metric: TransferFromServerTime
TotalSamples: 83
Accumulator: 200ms838.969us
ValueRate: 050.826us / second
Rate: 0.0211099 / second
Percentiles: 1%=001ms295.281us; 5%=001ms474.373us; 10%=002ms525.212us; 20%=002ms704.043us; 50%=002ms209.365us; 80%=002ms476.037us; 90%=003ms707.637us; 95%=006ms506.653us; 99%=006ms356.241us
Metric: TransferToServerTime
TotalSamples: 27092
Accumulator: 01m02s637ms749.132us
ValueRate: 016ms541.418us / second
Rate: 6.72181 / second
Percentiles: 1%=969.059us; 5%=001ms117.791us; 10%=001ms227.731us; 20%=001ms402.467us; 50%=002ms867.552us; 80%=002ms271.086us; 90%=003ms549.381us; 95%=008ms922.102us; 99%=009ms467.359us
Metric: TransferToServerTransformTime
TotalSamples: 27092
Accumulator: 06s347ms777.563us
ValueRate: 002ms618.820us / second
Rate: 6.72177 / second
Percentiles: 1%=041.313us; 5%=055.183us; 10%=061.279us; 20%=079.854us; 50%=154.392us; 80%=229.901us; 90%=307.764us; 95%=960.086us; 99%=002ms902.872us
Counter: CachedCompile
Value: 8352
Counter: CreateCompileHandles
Value: 4
Counter: CreateDataHandles
Value: 6733933
Counter: CreateXlaTensor
Value: 26457248
Counter: DestroyDataHandles
Value: 6733166
Counter: DestroyXlaTensor
Value: 26456405
Counter: DeviceDataCacheMiss
Value: 16545
Counter: MarkStep
Value: 16545
Counter: ReleaseDataHandles
Value: 6733166
Counter: UncachedCompile
Value: 4
Counter: XRTAllocateFromTensor_Empty
Value: 29
Counter: XrtCompile_Empty
Value: 128
Counter: XrtExecuteChained_Empty
Value: 128
Counter: XrtExecute_Empty
Value: 128
Counter: XrtMemoryInfo_Empty
Value: 128
Counter: XrtRead_Empty
Value: 128
Counter: XrtReleaseAllocationHandle_Empty
Value: 128
Counter: XrtReleaseCompileHandle_Empty
Value: 128
Counter: XrtSessionCount
Value: 9
Counter: XrtSubTuple_Empty
Value: 128
Counter: aten::_local_scalar_dense
Value: 83
Counter: xla::_copy_from
Value: 49842
Counter: xla::_log_softmax
Value: 8273
Counter: xla::_log_softmax_backward_data
Value: 8273
Counter: xla::_softmax
Value: 99276
Counter: xla::_softmax_backward_data
Value: 99276
Counter: xla::_unsafe_view
Value: 794208
Counter: xla::add
Value: 719751
Counter: xla::add_
Value: 6229008
Counter: xla::addcdiv_
Value: 1662712
Counter: xla::addcmul
Value: 206825
Counter: xla::addcmul_
Value: 1662712
Counter: xla::addmm
Value: 24819
Counter: xla::as_strided
Value: 204
Counter: xla::bernoulli_
Value: 314374
Counter: xla::bmm
Value: 595656
Counter: xla::cumsum
Value: 8273
Counter: xla::div
Value: 198552
Counter: xla::div_
Value: 314374
Counter: xla::embedding
Value: 24819
Counter: xla::embedding_dense_backward
Value: 24819
Counter: xla::empty
Value: 372891
Counter: xla::empty_strided
Value: 204
Counter: xla::expand
Value: 397104
Counter: xla::fill_
Value: 8273
Counter: xla::gelu
Value: 99276
Counter: xla::gelu_backward
Value: 99276
Counter: xla::index_select
Value: 24819
Counter: xla::mm
Value: 1820060
Counter: xla::mul
Value: 1472594
Counter: xla::mul_
Value: 3325424
Counter: xla::native_batch_norm
Value: 206825
Counter: xla::native_batch_norm_backward
Value: 206825
Counter: xla::ne
Value: 8273
Counter: xla::nll_loss_backward
Value: 8273
Counter: xla::nll_loss_forward
Value: 8273
Counter: xla::permute
Value: 794208
Counter: xla::relu
Value: 8273
Counter: xla::rsub
Value: 8273
Counter: xla::select
Value: 24819
Counter: xla::slice
Value: 41365
Counter: xla::sqrt
Value: 1662712
Counter: xla::sum
Value: 1025852
Counter: xla::t
Value: 2457081
Counter: xla::tanh
Value: 8273
Counter: xla::threshold_backward
Value: 8273
Counter: xla::transpose
Value: 595656
Counter: xla::unsqueeze
Value: 16546
Counter: xla::view
Value: 6684584
Counter: xla::zero_
Value: 1679620
Metric: XrtAllocateFromTensor
TotalSamples: 62932
Accumulator: 01m19s119ms463.458us
Mean: 001ms262.476us
StdDev: 645.279us
Rate: 14.684 / second
Percentiles: 25%=519.138us; 50%=001ms405.154us; 80%=002ms898.731us; 90%=002ms020.795us; 95%=002ms111.842us; 99%=002ms285.579us
Metric: XrtCompile
TotalSamples: 8
Accumulator: 02m13s859ms685.652us
Mean: 17s607ms335.706us
StdDev: 11s120ms475.388us
Rate: 0.0102895 / second
Percentiles: 25%=11s993ms628.971us; 50%=24s176ms871.533us; 80%=28s707ms958.173us; 90%=30s659ms194.501us; 95%=30s659ms194.501us; 99%=30s659ms194.501us
Metric: XrtExecute
TotalSamples: 9240
Accumulator: 49m34s163ms320.514us
Mean: 341ms831.782us
StdDev: 023ms447.833us
Rate: 2.1186 / second
Percentiles: 25%=343ms699.161us; 50%=343ms312.196us; 80%=344ms698.420us; 90%=344ms864.972us; 95%=344ms998.002us; 99%=344ms273.679us
Metric: XrtExecutorEvict
TotalSamples: 0
Accumulator: nanB
Mean: nanB
StdDev: nanB
Percentiles:
Metric: XrtReadLiteral
TotalSamples: 380
Accumulator: 178ms350.316us
Mean: 469.343us
StdDev: 103.655us
Rate: 0.0828029 / second
Percentiles: 25%=384.554us; 50%=470.285us; 80%=554.894us; 90%=604.092us; 95%=642.591us; 99%=730.619us
Metric: XrtReleaseAllocation
TotalSamples: 675400
Accumulator: 01m06s013ms726.930us
Mean: 106.214us
StdDev: 217.226us
Rate: 161.513 / second
Percentiles: 25%=016.334us; 50%=029.779us; 80%=105.503us; 90%=255.405us; 95%=495.802us; 99%=001ms177.625us
Like you say on single core it looks like training is working fine and performing well.
On the 8 core setup, could you try re-running with the following environment variables?
PT_XLA_DEBUG=1 TF_CPP_LOG_THREAD_ID=1 TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=tensor=5,computation_client=5,xrt_computation_client=5,aten_xla_type=1
That should tell us exactly what is happening within our framework when it's hanging.
Also I noticed that data_loader
in your train loop fn is already a pl.ParallelLoader
you don't need to manually do:
ids = data['ids'].to(device, dtype = torch.long)
and similar for other data too.xm.mark_step()
is not needed as pl.ParallelLoader
implicitly does this for every batch that it yields as described in our API_GUIDE.md:
https://github.com/pytorch/xla/blob/6d91a3b72b731a304d639562724eaf4968655728/torch_xla/distributed/parallel_loader.py#L46pl.ParallelLoader
you can simply use:
https://github.com/pytorch/xla/blob/6d91a3b72b731a304d639562724eaf4968655728/torch_xla/distributed/parallel_loader.py#L171Many thanks, @jyshon23! I implement your suggestion and as a baseline, I confirmed working when nprocs = 1. However nprocs = 8 is still hanging. Though this time removing xm.mark_step()
allows it to move on to second step but stuck at calling the model outputs = model(ids, mask, token_type_ids)
I expect by enabling the environment variables there will be some logs coming out from the interpreter?
os.environ['PT_XLA_DEBUG']="1"
os.environ['TF_CPP_LOG_THREAD_ID']="1"
os.environ['TF_CPP_MIN_LOG_LEVEL']="0"
os.environ['TF_CPP_VMODULE']="tensor=5"
os.environ['computation_client']="5"
os.environ['xrt_computation_client']="5"
os.environ['aten_xla_type']="1"
Thanks!
Hi! I have finally able to pull information out from enabling the environment variables by running the script in .py instead of running in .ipynb like: !PT_XLA_DEBUG=1 TF_CPP_LOG_THREAD_ID=1 TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=tensor=5 computation_client=5 xrt_computation_client=5 aten_xla_type=1 python -u train.py
Here are the results though it is not informative?
2021-05-17 23:29:57.812187: I 3547 tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:root:TPU has started up successfully with version pytorch-1.8.1
2021-05-17 23:30:00.692445: I 3560 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.706905: I 3564 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.713162: I 3570 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.724224: I 3576 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.734355: I 3580 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.744493: I 3584 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:00.754056: I 3588 tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) b0f6c9bce863:53413
2021-05-17 23:30:25.099310: I 3576 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:26.269124: I 3570 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:26.974932: I 3564 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:27.845523: I 3584 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:28.775142: I 3588 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:28.874595: I 3560 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:29.900606: I 3580 tensorflow/compiler/xla/xla_client/computation_client.cc:203] Fetching mesh configuration for worker tpu_worker (host_ordinal=0):0 from mesh service at b0f6c9bce863:53413
2021-05-17 23:30:37.928487: I 3559 tensorflow/core/profiler/rpc/profiler_server.cc:46] Profiler server listening on [::]:9012 selected port:9012
2021-05-17 23:30:46.457900: I 3559 torch_xla/csrc/tensor_util.cpp:28] Using BF16 data type for floating point values
Train for 12638343 steps per epoch
num_training_steps = 49368, world_size=8
Starting training in epoch: 0
Entering training loop
2021-05-17 23:30:51.221439: I 3559 torch_xla/csrc/tensor.cpp:1373] 204 live tensors: devices=()
2021-05-17 23:30:51.221548: I 3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:30:51.221591: I 3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:30:51.272911: I 3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash f8539b4c16d32784d76c0592dedb929d on device TPU:0
2021-05-17 23:30:51.274962: I 3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash e9be56370f81f759e99c0d5c7da37fa2
2021-05-17 23:30:51.357767: I 3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 ...
2021-05-17 23:30:51.404462: I 3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 done!
2021-05-17 23:30:51.405821: I 3559 torch_xla/csrc/tensor.cpp:1563] Graph hash e9be56370f81f759e99c0d5c7da37fa2 is computation hash e622f901bc78b50e06778eb7026a41da
2021-05-17 23:30:51.406392: I 3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=1020
2021-05-17 23:30:51.410112: I 3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 ...
2021-05-17 23:30:51.419282: I 3693 torch_xla/csrc/tensor.cpp:1303] Executing IR graph hash e9be56370f81f759e99c0d5c7da37fa2 on device TPU:0 done!
Start
Extract data
Zero Grad
Model
Loss
2021-05-17 23:31:24.800329: I 3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:31:24.800419: I 3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:31:24.800470: I 3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash 2c69fbdc154c900d1b019feaa507cd74 on device TPU:0
2021-05-17 23:31:24.801801: I 3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e
2021-05-17 23:31:24.914196: I 3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 ...
2021-05-17 23:31:24.959115: I 3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 done!
2021-05-17 23:31:24.961928: I 3559 torch_xla/csrc/tensor.cpp:1563] Graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e is computation hash 986ddacef03810dea5d00e9ec233f56
2021-05-17 23:31:24.962524: I 3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=2273
2021-05-17 23:31:24.962581: I 3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 ...
2021-05-17 23:31:25.021284: I 3693 torch_xla/csrc/tensor.cpp:1303] Executing IR graph hash 8e5c6ca6d938cff3d4dcb32fbcffb22e on device TPU:0 done!
[xla:0](0) Loss=1.28125 Rate=0.00 GlobalRate=0.00 Time=Mon May 17 23:31:25 2021
Backward
2021-05-17 23:31:25.059904: I 3559 torch_xla/csrc/tensor.cpp:1373] 439 live tensors: devices=()
2021-05-17 23:31:25.059993: I 3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:31:25.060011: I 3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:31:25.061081: I 3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash bd01710a26633b903f90743d92f0327d on device TPU:0
2021-05-17 23:31:25.062721: I 3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash dd5421c93af5a320b6606bceae91e861
2021-05-17 23:31:25.132474: I 3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 ...
2021-05-17 23:31:25.206239: I 3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 done!
2021-05-17 23:31:25.211813: I 3559 torch_xla/csrc/tensor.cpp:1563] Graph hash dd5421c93af5a320b6606bceae91e861 is computation hash ae93d9c8850271bb1f23a071977519e0
2021-05-17 23:31:25.213132: I 3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=4522
2021-05-17 23:31:25.216170: I 3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 ...
Step Optimizer
2021-05-17 23:31:25.384339: I 3693 torch_xla/csrc/tensor.cpp:1303] Executing IR graph hash dd5421c93af5a320b6606bceae91e861 on device TPU:0 done!
Scheduler
Start
Extract data
Zero Grad
Model
Loss
2021-05-17 23:31:25.448908: I 3559 torch_xla/csrc/tensor.cpp:1122] Waiting on device barrier for device TPU:0 ...
2021-05-17 23:31:25.448985: I 3559 torch_xla/csrc/tensor.cpp:1128] Waiting on device barrier for device TPU:0 done!
2021-05-17 23:31:25.449014: I 3559 torch_xla/csrc/tensor.cpp:1168] Tensors graph hash c87bb48399127cedece4ca286d0e1ed2 on device TPU:0
2021-05-17 23:31:25.452055: I 3559 torch_xla/csrc/tensor.cpp:1592] Parameter sequence graph hash 83f75c15d76c2047dda16e371c8f7416
2021-05-17 23:31:25.520943: I 3559 torch_xla/csrc/tensor.cpp:1556] Compiling IR graph hash 83f75c15d76c2047dda16e371c8f7416 on device TPU:0 ...
2021-05-17 23:31:25.590626: I 3559 torch_xla/csrc/tensor.cpp:1561] Compiling IR graph hash 83f75c15d76c2047dda16e371c8f7416 on device TPU:0 done!
2021-05-17 23:31:25.597732: I 3559 torch_xla/csrc/tensor.cpp:1563] Graph hash 83f75c15d76c2047dda16e371c8f7416 is computation hash de01864c3fcb3971fbaa97117e5af3c9
2021-05-17 23:31:25.599633: I 3559 torch_xla/csrc/tensor.cpp:1602] TensorsGraphSize=7809
2021-05-17 23:31:25.599766: I 3693 torch_xla/csrc/tensor.cpp:1298] Executing IR graph hash 83f75c15d76c2047dda16e371c8f7416 on device TPU:0 ...
Stuck at second step in loss... Anything I can provide more information? Many thanks
OK, thanks for the info! Looks like the graph execution is hanging as you can see in the last log. For us to be able to dump the graph that is hanging on execution by setting XLA_SAVE_TENSORS_FILE=path/to/file.hlo XLA_SAVE_TENSORS_FMT=hlo XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1
.
What is weird though is that I don't see logs printed after "Loss" as I'd expect execution after "Backward" and "Mark Step" are logged.
Hi, this is the graph for the first step since it does not finish the second step I think the graph is not saved.
I am trying to make the script accessible perhaps one could reproduce the problem?
Thanks!
Hi, I have made the notebook accessible it should work straight away. Thanks.
https://colab.research.google.com/drive/1cOiQjaElTXsdxNVDhSMfvauibyNzkImW?usp=sharing
Hi... Is there anything I can do to understand the issue more? Thanks
Hi sorry for the slow response, running it myself currently.
You can also view the turned on debug logging under Runtime > View Runtime Logs
:
Also, given that it looks like you're trying to train using HF transformers library have you given the trainer.py based scripts a try? We test the trainer based examples on a nightly basis so those should be less likely to have breakages. Example command we run nightly testing on for classification finetuning task:
python examples/pytorch/xla_spawn.py \
--num_cores 8 \
examples/pytorch/text-classification/run_glue.py \
--logging_dir=./tensorboard-metrics \
--task_name MNLI \
--cache_dir ./cache_dir \
--do_train \
--do_eval \
--num_train_epochs 3 \
--max_seq_length 128 \
--learning_rate 3e-5 \
--output_dir MNLI \
--overwrite_output_dir \
--logging_steps 30 \
--save_steps 3000 \
--overwrite_cache \
--tpu_metrics_debug \
--model_name_or_path roberta-large \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16
Thanks, Daniel I will have a look at the trainer.py script. The reason for this script is to learn the underlying nuts and bolts for the task at hand rather than calling higher-level trainer.py to do the job.
Anyhow, apparently there is something inside TPU that is out of my expertise as this script works with 1 core.
Please, only when you have time, look into this issue as I am curious as to why this happens.
I shall try trainer.py now. Thanks again.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
❓ Questions and Help
Hi all, I am training a transformer model with custom dataset (~ 800mb data) in Colab (or Kaggle) TPU and my code works when setting nprocs = 1 but will hang when after reaching
xm.mark_step()
in first step and entering second step.Unfortunately, I cannot run
met.metrics_report()
since it won't reach the first step though I could produce when nprocs = 1.I tried troubleshooting by checking whether my computation graph randomly changes such that XLA will need to recompile before feeding into TPU devices. The model is pretty standard and other than dropout layer I don't see any problem at all. I also check pad all sentences to the same length such that their input tensor shape doesn't change in each batch. Also I make sure I do not use .item(), if or any so tensor don't need to passing back and forth etc...
Below are the codes related to the problem at hand, I believe the data loading part is fine as I am able to train when nprocs = 8?
The computation graph, pretty standard I just modify from some tutorials:
The training loop, also pretty standard (print the steps to see where the loop reaches when training):
The main map function:
Calling XMP.SPAWN:
The outputs:
It hangs and will not enter the second step, i.e. bi = 2 any ideas?
Many thanks.