If C * window_w % 32 != 0, add support to offset by padding of 32 when writing to activation block. We already have support to pad the weights inner dim with 0s for this scenario. Activations can have garbage in inner dim.
Re-work split reader implementation to read activations in parallel into the same CB. This will prevent this hard constraint - (act_block_h / out_subblock_h) % 2 == 0. Currently, activations are split at the subblock granularity and activations are split into 2 CBs. The compute waits and pops from 2 cbs which is not necessary.