intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
345 stars 37 forks source link

developer_document.md need elaboration on determining buffer sizes? #287

Open hpcpony opened 3 months ago

hpcpony commented 3 months ago

In the example for adding to gptneox_mem_req I see that n_layers comes from the num_hidden_layers in the config.json file, but where does the 512, 512, and 1024 come from? Maybe a comment in the document would help.

I was looking to extend the existing bloom capability to handle https://huggingface.co/bigscience/bloom but it's not obvious to me how chose the right scratch sizes from the config.json.

zhentaoyu commented 3 months ago

hi, @hpcpony, sorry for the confusion. The model_scratch is like a kind of kernel workspace and it will be used in model eval process. They are roughly estimated values. About how to set them when adding new models, my experience is that looking for a reference model with similar parameters. It will automatically enlarge these memories when meets larger bs and ctx_size if you use our python API. cc @Zhenzhong1 and @a32543254.