Open Blank141 opened 2 months ago
Great Job! When running the code I observed that the evaluate time is much larger than the training time is there any solution?
Thank you!
Batching etc. should help reducing evaluation time.
And last mha layer most qkv computation is a waste and may be removed temporally.
Great Job! When running the code I observed that the evaluate time is much larger than the training time is there any solution?