jafioti / luminal

Deep learning at the speed of light.
https://luminalai.com
Apache License 2.0
1.45k stars 90 forks source link

Use new_buffer_with_bytes_no_copy when creating Metal buffer. #36

Closed jcsoo closed 6 months ago

jcsoo commented 6 months ago

I'm not a Metal expert, but my understanding is that new_buffer_with_data() copies data from the source buffer to a newly allocated GPU buffer, and new_buffer_with_bytes_no_copy() wraps a new GPU buffer object around an existing chunk of memory such as a mmap() buffer, which we already have. This is especially helpful when the file is still cached in memory.

I believe that one design constraint for the GGUF format was laying out weights aligned so that this technique could be used.

For me running on a Mac Mini M2 Pro with 16GB, this change brings model load time from around 30 seconds down to about 22 seconds on a first load, and down to about 175ms for subsequent loads. Output is identical.

jafioti commented 6 months ago

yes you're correct, I thought i had that in there. Good catch!