tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
470 stars 73 forks source link

[Perf] tt: :tt_metal::allocator::FreeList::deallocate is very slow for nanogpt training #14374

Open dmakoviichuk-tt opened 2 weeks ago

dmakoviichuk-tt commented 2 weeks ago

Describe the bug tt: :tt_metal::allocator::FreeList::deallocate takes ~5% of the total host time during nanogpt training.

To Reproduce Run nanogpt training.

Expected behavior It should be faster. Screenshots

abhullar-tt commented 2 weeks ago

Hey Denys, this is something that has come up before. I haven't had the chance to look into optimizing it but there definitely is room for optimizations in trying to deduce where the block to free should be inserted back into the free list.

dmakoviichuk-tt commented 1 week ago

hi @abhullar-tt I think current freelist allocator could be slightly optimized overall but algo will remain the same. Not sure if it help a lot.

abhullar-tt commented 1 week ago

Member

We previously talked about exploring something like: http://www.gii.upv.es/tlsf/

dmakoviichuk-tt commented 1 week ago

@abhullar-tt looks interesting. But as a first step I think it is good to optimize existing one to make sure we don't add new issues.

abhullar-tt commented 1 week ago

@abhullar-tt looks interesting. But as a first step I think it is good to optimize existing one to make sure we don't add new issues.

Yes definitely agree and know there is room to optimize existing implementation