I'm not a Metal expert, but my understanding is that new_buffer_with_data() copies data from the source buffer to a newly allocated GPU buffer, and new_buffer_with_bytes_no_copy() wraps a new GPU buffer object around an existing chunk of memory such as a mmap() buffer, which we already have. This is especially helpful when the file is still cached in memory.
I believe that one design constraint for the GGUF format was laying out weights aligned so that this technique could be used.
For me running on a Mac Mini M2 Pro with 16GB, this change brings model load time from around 30 seconds down to about 22 seconds on a first load, and down to about 175ms for subsequent loads. Output is identical.
I'm not a Metal expert, but my understanding is that new_buffer_with_data() copies data from the source buffer to a newly allocated GPU buffer, and new_buffer_with_bytes_no_copy() wraps a new GPU buffer object around an existing chunk of memory such as a mmap() buffer, which we already have. This is especially helpful when the file is still cached in memory.
I believe that one design constraint for the GGUF format was laying out weights aligned so that this technique could be used.
For me running on a Mac Mini M2 Pro with 16GB, this change brings model load time from around 30 seconds down to about 22 seconds on a first load, and down to about 175ms for subsequent loads. Output is identical.