Mistral/Mixtral WH bringup

Bring-up of Mistral and Mixtral on Wormhole for single-chip and multi-chip respectively.

Mixtral8x7b

In Main, inside models/demos/t3000/mixtral8x7b

Perf Target -> 33 tok/s/u

[6 Jun 2024]

Perf measured in tok/s/u: KV_len: Device | E2E 32: 27.13 | 7.69 128: 27.00 | 7.04 1024: 25.47 | 4.03 2048: 23.93 | 2.72

[25 May 2024]

Perf estimated at around 1.78 ms per layer -> to 17.5 tok/s end2end (seqlen = 1) and around 3.1 ms per layer -> 10 tok/s (seqlen=2048).

[x] Correctness validated for hundreds of iterations. Users stopping at EoS token.
[ ] Perf optimizations
- [x] Device perf above target: 24.5 t/s/u
  - [x] e2e perf 1% of device: Main culprit is ttnn multidevice. E.g. ttnn.linear takes 7ms vs ttlib 70us.
  - [x] Fixed issue with program cache not being properly used.
  - [ ] Largest bottlenecks are the MLP matmuls followed by allGather
  - [x] Update cache ops follow right after, partially due to pad and transpose ops.
- [ ] Implement update cache with parallelization on batch
- [x] Implement fused op for pad + transpose and unpad + transpose.
[ ] Add embedding + argmax to device
[x] Port the code to make use of ttnn multiDevice API
[x] Redo sharded KV cache
- [x] Due to issue in softmax correctness (different users would have different softmax values, with the same input), the keys and values were commented out back to interleaved
- [x] Explore sharded softmax (optimal) and report issue (or convert to interleave right before and after)
[x] Improve test model PCC comparison
- [x] Instead of relying on baseline output, instead do teacher forcing and compare PCC across many iterations. It should grow higher after a short while.

On main models/demos/mistral.

Device perf: 13.3 tok/s/u e2e perf: 10.9 tok/s/u

[x] Improved weight loading to use flag system.
- [x] Stress test (repeat iteration 100 on single decoder test for 1 to 2 hours to test stability)
  - [x] Runs out of L1 memory after iteration 567
    - [x] Initial look at L1 visualizer after 70 iterations didn't showcase any memory leaks
[x] Batched attention. [MERGED]
- [x] initial e2e perf measured at around 3.2 tok/s/u
- [x] Hanging after iteration 31

[x] CI test decoder hanging. [ RESOLVED]
[x] Stress test again on latest version and debug memory allocation issue if still required