Closed eatpk closed 3 months ago
Thanks! I will check the performance improvement on my GPT benchmark. If I notice more than 10-20% improvements, I will merge this PR.
This PR gave me 5-15% improvement in my experiments. Still an improvement! You can merge this PR.
Summary
Flattening the tensor before unflattening to minimize the data cloning from numpy mmap to torch.Tensor.
Related Issues
N/A
Test Plan
performance tested, improvement by 50%.