Open Lime-Cakes opened 1 year ago
Hi,
the exception you're hitting is thrown when the allocation of stream buffer fails: in this case, too much memory is being used by a data stream (Single stream of length 149766584
as per the error), which can happen e.g. when having too large outputs being streamed back from the IPU to the host.
It would be great if you could share instructions to reproduce the error and SDK version you've been using: this would enable us to to share specific remedies to the issue. In the meanwhile, here are some more generic suggestions that might help:
Prefetching allows the host to prepare the next buffer for the infeed/outfeed while it waits for the IPU to compute the current buffer. You could try to reduce the buffering depth of the infeeds/outfeeds in PopART. If your program is using buffering depth greater than 1, this might improve performance but comes at the cost of increasing the memory footprint. You could alternatively try to disable prefetching, which is enabled by default. You can find more details on how to do that in our PopART user guide.
Try to switch to a different poptorch.OutputMode()
e.g. if you're using All
, you might want to try to use Final
instead. This mode only returns the last batch instead of returning a result for each batch, which would decrease the accumulated amount of output data used. See PopTorch user guide for more details.
Try to reduce the batch size and/or the PopTorch deviceIterations
- see PopTorch user guide.
Hope this helps for now, I encourage you to share a reproducer for us to investigate this further.
Best, Arianna
This is for training, so I left outputMode at default, which should already be last. What values would be streamed from IPU to host when training? The loss calculation would be done on the IPU since it's part of the model.
Could the stream error be related to data streaming between IPUs?
No, those streams are for IPU/host communication. Have you had a chance to try any of the other suggestions @ariannas-graphcore provided?
Yeah. Those either doesn't work, or can't be used (lower further)
By any chance are you trying to profile the Poplar executable by generating a PopVision profile? (by setting the environment variable: POPLAR_ENGINE_OPTIONS='{"autoReport.all"="true"}'
)
I have seen that occasionally cause host stream issues in the past.
We can try and provide more specific support but we would need additional information on the system, software and model you are encountering the error in:
Ideally if you can send us a code sample which reproduces the error we can provide more specific advice to help fix the problem.
Is there an explanation of this error?