graphcore / poptorch

PyTorch interface for the IPU
https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/
MIT License
176 stars 14 forks source link

Errors when testing /examples/mnist.py #3

Closed Blair-Johnson closed 2 years ago

Blair-Johnson commented 2 years ago

Hello, I have installed the poplar SDK v2.4.0 and the accompanying poptorch wheel. I'm using an IPU-M2000, and the mnist.py example in this repository fails for me during the testing phase yielding the following output:

Graph compilation:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:51<00:00]
2022-01-18T21:18:39.062713Z popart:session 51249.51249 W: Trying to set the random seed, but this session has no random behaviour. Doing nothing.
PoptorchIPU loss at batch: 0 is tensor([2.2745, 2.3198, 2.2587, 2.3063])
Training accuracy: 11.25% from batch of size 80
PoptorchIPU loss at batch: 10 is tensor([0.5195, 0.3766, 0.4090, 0.4410])
Training accuracy: 87.5% from batch of size 80
PoptorchIPU loss at batch: 20 is tensor([0.3463, 0.7661, 0.6643, 0.3699])
Training accuracy: 76.25% from batch of size 80
PoptorchIPU loss at batch: 30 is tensor([0.1408, 0.4287, 0.4940, 0.1002])
Training accuracy: 90.0% from batch of size 80
Done training
[16:18:46.542] [poptorch::python] [critical] poptorch.poptorch_core.Error: In poptorch/popart_compiler/source/CompilerImpl.cpp:627: 'poptorch_cpp_error': Failed to acquire 1 IPU(s)
Error raised in:
  [0] Compiler::initSession
  [1] LowerToPopart::compile

Traceback (most recent call last):
  File "mnist.py", line 182, in <module>
    example()
  File "mnist.py", line 176, in example
    test()
  File "mnist.py", line 150, in test
    output = inference_model(data)
  File "/root/gcore/lib/python3.6/site-packages/poptorch/_poplar_executor.py", line 761, in __call__
    self._compile(in_tensors)
  File "/root/gcore/lib/python3.6/site-packages/poptorch/_poplar_executor.py", line 505, in _compile
    *trace_args)
poptorch.poptorch_core.Error: In poptorch/popart_compiler/source/CompilerImpl.cpp:627: 'poptorch_cpp_error': Failed to acquire 1 IPU(s)
Error raised in:
  [0] Compiler::initSession
  [1] LowerToPopart::compile

I was surprised that the IPU was accessible during training, but unavailable during testing. Any ideas would be greatly appreciated!

AnthonyBarbier commented 2 years ago

Hi, How many IPUs are available on your system? It might be that you don't have enough IPUs available on the system to have both the training and validation graphs loaded at the same time and therefore you might need to detach the training graph before running the validation one by doing training_model.detachFromDevice()

Blair-Johnson commented 2 years ago

You're correct; we only have a single IPU. This fixed it, thank you!