Closed abstractdescutcheon closed 7 months ago
This is very strange. I'd like to investigate a bit more and either break up with the safetensors library entirely or figure out why it's acting up and fix it. It's definitely not normal for it to take that long. I've only ever seen it load at roughly the raw read speed of the drive, but I'm thinking this could have to do with seeking?
Is this on a HDD by any chance?
You could try running with something like python3 -m trace --timing --trace
to narrow down performance areas here.
I too have occasionally patched around huggingface's libraries to remove data copying and acquire incredible performance gains. You can share device-loaded models across processes and freaky stuff.
Ok, good to know that this isn't normal. The model files are on a NVMe drive, so it's not a seek. These numbers are also with a warm cache--I'd bet a lot of the file is already paged. My guess was that it's something to do with the WSL mmap on NTFS. I'll look into it more, this might just be a issue for me. If anyone else notices slow load times give a 👍
I implemented something like this now, except a lot more needlessly complicated probably. I'm not sure if it's Windows or WSL compatible yet.
Caveat: My system is WSL2, so I don't know if this improves performance for other setups
Model load times seemed slower than they should be. Taking minutes to load even small models (~7GB). Putting the model files on solid-state did not improve times. But I could
cat model.safetensors > /dev/null
in about 10 seconds. So the loader must be the bottleneck. Turns out it's the safetensors library, I didn't dig into why it's so slow. But, the safetensors format is simple and it is easy to implement in python reader. After making that change model load times dropped to about 20 seconds, which is closer to the expected time.After
Before
Note:
TheBloke_Llama-2-13B-chat-GPTQ
is 6.8GB