About the number of model load times.

intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Apache License 2.0

1.46k stars 225 forks source link

About the number of model load times. #460

Open gukejun1 opened 8 months ago

gukejun1 commented 8 months ago

Describe the issue

Dear. Currently, each rank loads the complete data of the model and then performs tensor segmentation. For example, if there are eight ranks and eight models are loaded, the memory may be reused, causing a large waste of memory or even memory overflow. Is there a plan to update all ranks and load only one copy of model data?

jingxu10 commented 7 months ago

Is this distributed inference with DeepSpeed? Also, may I know on CPU or on GPU?