graphcore / poptorch

PyTorch interface for the IPU
https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/
MIT License
176 stars 14 forks source link

Error: In include/poptorch_err/ExceptionHandling.hpp:76: 'poplar_stream_memory_allocation_error' #9

Open Lime-Cakes opened 1 year ago

Lime-Cakes commented 1 year ago

Is there an explanation of this error?

/usr/local/lib/python3.8/dist-packages/poptorch/experimental.py in __exit__(self, exc_type, value, traceback)
    253         if self._compile_using == enums.Compiler.PopART:
    254             # Compile the captured graph using PopART.
--> 255             self._executable = poptorch_core.compileWithManualTracing(
    256                 self._options.toDict(), accessAttributes, self._training,
    257                 self._dict_optimizer,

Error: In include/poptorch_err/ExceptionHandling.hpp:76: 'poplar_stream_memory_allocation_error': /opt/jenkins/workspace/poplar/poplar_ci_ubuntu_20_04_unprivileged/popart/willow/src/popx/irlowering.cpp:3516 Out of memory. Single stream of length 149766584 in bufferIndex 1
Error raised in:
  [0] popart::Session::prepareDevice: Poplar compilation
  [1] Compiler::compileAndPrepareDevice
  [2] LowerToPopart::compile
  [3] compileWithManualTracing
ariannas-graphcore commented 1 year ago

Hi,

the exception you're hitting is thrown when the allocation of stream buffer fails: in this case, too much memory is being used by a data stream (Single stream of length 149766584 as per the error), which can happen e.g. when having too large outputs being streamed back from the IPU to the host.

It would be great if you could share instructions to reproduce the error and SDK version you've been using: this would enable us to to share specific remedies to the issue. In the meanwhile, here are some more generic suggestions that might help:

Hope this helps for now, I encourage you to share a reproducer for us to investigate this further.

Best, Arianna

Lime-Cakes commented 1 year ago

This is for training, so I left outputMode at default, which should already be last. What values would be streamed from IPU to host when training? The loss calculation would be done on the IPU since it's part of the model.

Could the stream error be related to data streaming between IPUs?

payoto commented 1 year ago

No, those streams are for IPU/host communication. Have you had a chance to try any of the other suggestions @ariannas-graphcore provided?

Lime-Cakes commented 1 year ago

Yeah. Those either doesn't work, or can't be used (lower further)

payoto commented 1 year ago

By any chance are you trying to profile the Poplar executable by generating a PopVision profile? (by setting the environment variable: POPLAR_ENGINE_OPTIONS='{"autoReport.all"="true"}') I have seen that occasionally cause host stream issues in the past.

We can try and provide more specific support but we would need additional information on the system, software and model you are encountering the error in:

Ideally if you can send us a code sample which reproduces the error we can provide more specific advice to help fix the problem.