guillaume-be / rust-bert

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)
https://docs.rs/crate/rust-bert
Apache License 2.0
2.6k stars 215 forks source link

Loading multi-file pytorch_model.bin files (eg. Mistral)? #454

Open cicero-ai opened 5 months ago

cicero-ai commented 5 months ago

Love the crate. Curious, any way to get these larger models like Mistral 7X8B into it?

Their pytoch_model.bin files are split into multiple files, and uncertain how to convert that. Have used the convert_model.py utility lots on smaller models and works great, but not sure how to handle a larger model.

Tried to concatenate all three "pytorch_model-0000X-of-00003.bin" files, but that just errored out saying invalid zip file. Is there anything that can be done to get these larger models loaded via rust-bert? I'm quite familiar with the repo code now, and am more than happy to put any necessary hours into developing out whatever solution is necessary. If you could point me in the right direction, I'm sure I'd be able to figure out the rest.

Thanks in advance.

guillaume-be commented 5 months ago

Hello @cicero-ai ,

I would look in the direction of https://github.com/huggingface/transformers/blob/d1d94d798f1ed5c0b5de9a794381aeb7dc319c12/src/transformers/modeling_utils.py#L4082 to see how Python libraries are doing it. It seems indeed the files are not combined into a single file but rather opened sequentially and loaded in the model (the example given uses safetensors, I am unsure if you can open the individual shard files you are considering and get a valid weights dictionary).

One approach could be to create a new Resource pointing to a folder and a regex/file pattern for the archive files. You could then extend the load_weights method to handle this resource, looping through the matching shards and loading them into the model

cicero-ai commented 5 months ago

Great, that should be all the direction I need and thank you very much for your time. One way or another it's imperative I get this working, so you'll have a PR shortly. Whether or not you want to merge it is up to you.

Actually, while I'm here one more thing -- word embeddings. Sentence embeddings work great, but I can't find a single word embeddings LLM that has a vocab resource for obvious reasons. I'm sure I can figure this one out myself and will also include PR for it, but since I'm here, if you have any directional advice on implementing word embeddings, I'm all ears.

Cheers, Matt