Experiments on using torch 2.0 scaled dot product attention

glistering96 / AlphaRouter

2 stars 0 forks source link

Closed glistering96 closed 1 year ago

glistering96 commented 1 year ago

On TSP N=100, for 250 epoch, qkv_dim =64, 4 encoder layer.

Run on rtx 3060 12GB

Trained on 256 episodes

torch: 3.95 min 6.9 GB my: 3.88 min 6.9 GB

glistering96 commented 1 year ago

On TSP N=20, for 5000 epochs, qkv_dim =64, 4 encoder layer. Trained on 256 episodes

Run on rtx 3060 12GB

torch: 12.48 min my: 13.88 min

glistering96 commented 1 year ago

Found out that the integrated implementation from pytorch takes a lot of memory. We revert all changes related to torch 2.0 attention.

glistering96 commented 1 year ago

Clsoing this issue as it contains some wrong experiments settings and results with wrong implementation.