Porting Swin model to n300

saichandax commented 1 month ago

[ ] Measure and record current performance.
[ ] Rebase the model to main, ensure the PCC = 0.99
[ ] #13403
[ ] Provide Op Report
[ ] Check Model into CI
[ ] Implement Demo
[ ] #13402

Executive summary (as of Nov 5):

Single Device implementation on n300 is complete BLOCKED due to (Pending CIs and Approvals):
- Torch ops:
  - adaptive_average_pool (#13543)
    - Tried ttnn.global_average_pool. Didn't serve the purpose.
  - conv2d (#14034)
  - roll #12157
  - permute (#12154)
  - reshape (#12153)
- PCC = 0.95
- Demo results are approved and recorded here
- BS = 8 (try higher batch size)
- perf numbers are:
  - E2E perf: 4 samples/sec
  - Device perf: 4.82 samples/sec
Data parallel implementation on n300 is complete:
- Torch ops (as in single device implementation)
- PCC = 0.95 (Debugging in progress)
- BS = 8 (Need to test with BS>8)
- perf numbers are:
  - E2E perf: 1.45 samples/sec
  - Device perf: 139.77 samples/sec
Trace_2cqs implementation is not done yet (blocked due to torch ops in the model).

ToDo:

Merge Single Device implementation after passing CIs and PR approvals.
Complete Data parallel Low PCC issue.
To implement perf with trace_2cqs

Sudharsan-V commented 1 month ago

The porting of swin model to n300 is in progress. The pcc of the swing model sub-modules are

swin_embedding >0.99
swin_self_attention ~0.94
swin_attention ~0.94
swin_layer ~0.98

Corresponding draft PR #13475

Sudharsan-V commented 1 month ago

Enabled the pipeline for the Swin model with pcc ~0.94.
Torch ops: roll, adaptive_avgpool1d, conv2d, reshape and permute.
The demo pipeline is enabled with an accuracy ~0.85 (# samples 24)

Sudharsan-V commented 1 month ago

@mbahnasTT , The pipeline for the Swin for Image classification is enabled. The PCC of all sub-modules is >0.98, but the PCC of the entire model has dropped to ~0.91. Although this PCC value is slightly lower, the model's accuracy on the ImageNet dataset is as follows(for 40 samples):

Accuracy between TTNN model and ImageNet labels: 0.825 Accuracy between PyTorch model and ImageNet labels: 0.85 Accuracy between TTNN and PyTorch model: 0.90

Given these results, Can we go ahead with this model? Corresponding draft PR https://github.com/tenstorrent/tt-metal/pull/13475

mbahnasTT commented 1 month ago

@Sudharsan-V OK, please go ahead. Please keep record of current status and open a P2 issue. Please run the ImageNet on a larger set (500-1K images), you can look at ViT or RN50 script.

Sudharsan-V commented 1 month ago

@Sudharsan-V OK, please go ahead. Please keep record of current status and open a P2 issue. Please run the ImageNet on a larger set (500-1K images), you can look at ViT or RN50 script.

Sure, will run and update the results here

Sudharsan-V commented 1 month ago

@mbahnasTT , The demo is triggered for swin pipeline similar to ViT model for ImageNet-1k validation Dataset and the results are as follows(1000 images).

Accuracy between TTNN model and ImageNet labels: 0.775 Accuracy between PyTorch model and ImageNet labels: 0.787 Accuracy between TTNN and PyTorch model: 0.89

tenstorrent / tt-metal

Porting Swin model to n300 #13333