Closed LPanosTT closed 15 hours ago
@LPanosTT, this means that we ran out of L1 so we need to change sharding strategy probably to Block sharded. Dropping precision on the weights might work too. Another thing is we might have fragmented memory so that also could lead to this error, but given that we can repro in isolation as a single op makes me think that sharding strategy is most likely candidate.
I don't think this is a bug with the op, unless we think that it should have automatically picked block sharding.
@LPanosTT I think we want to dump the output memory config that the op picks to double check this is something sane. It might have picked something pretty non-optimal by default.
@LPanosTT Please use BLOCK_SHARDING in this case since the height is small (cannot do more than 7 cores), but is quite wide. It passes for me with BLOCK_SHARDING.
conv_config = ttnn.Conv2dConfig(
shard_layout=ttnn.TensorMemoryLayout.BLOCK_SHARDED,
)
@nsmithtt on your note above: "unless we think that it should have automatically picked block sharding.".
Maybe this a good place to ask this question. Can op adapt and choose what is required to be functional ?
I believe ideally compiler should play with specific op config overrides just in order to boost perf (and expectation from op would be that it can run with default/potentially low perf implementation)?
@tt-mpantic: ... I believe ideally compiler should play with specific op config overrides just to boost perf (and expectation from op would be that it can run with default/potentially low perf implementation). ...
I agree with this statement! It'll be good from a generality perspective to just have an op path (configuration) that works, without the need for special casing for specific sharding depending on weight or activation shapes.
@nsmithtt: ...this means that we ran out of L1 so we need to change sharding strategy probably to Block sharded. Dropping precision on the weights might work too. ...
Please remind me, @LPanosTT did you hit some bigger issues when pushing conv op to work on DRAM? That way we'll for sure escape memory management issues and sharding.
Please remind me, @LPanosTT did you hit some bigger issues when pushing conv op to work on DRAM? That way we'll for sure escape memory management issues and sharding.
So far.... no.
@mywoodstock Using block sharding for this conv worked. Thanks!
shall we close this then @LPanosTT ?
@mywoodstock Yes.
@nsmithtt on your note above: "unless we think that it should have automatically picked block sharding.". Maybe this a good place to ask this question. Can op adapt and choose what is required to be functional ?
Unfortunately I don't think this is possible, for the exact case that this issue is covering. There are situations where either your memory is fragmented or L1 is very full and you don't know that you'll run out of mem until you actually invoke the op at which point it's too late.
@nsmithtt can we not block shard in dram?
@mywoodstock, is DRAM sharded supported now?
@LPanosTT, conv cannot stream activations from dram, but it can with the weights.
@nsmithtt @mywoodstock Just to get a bit more clarity on my side.
Is it possible to utilize conv2d op, without a specific sharding? E.g.
My question here is can we run convs without any sharding requirement in L1? if not, is this by design? Not to utilize DRAM for activations? If that isn't the case, should we treat this as a bug?
Thanks for providing more context! :))
Describe the bug The convolution is:
To Reproduce
Run the following ttnn pytest:
Additional context Tensorflow resnet50 contains this conv. This is blocking the bringup through forge-fe --> MLIR --> ttnn runtime