Open PurvangL opened 5 days ago
You can try the lazy init as in here and file in a PR if it works. https://github.com/hpcaitech/ColossalAI/blob/8e718a1421203e0f5607f477e1a998567c70d123/examples/language/llama/benchmark.py#L245
Thanks @Edenzzzz for suggestion. I will try. I also have one more question. During evaluation of OPT, eval loss for model trained with hybrid_parallel plugin is 5x larger than gemini plugin. and it's like this for most of the OPT variants. Do you know why?
Is there an existing issue for this bug?
🐛 Describe the bug
I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB. How can I to run OPT-66B benchmark with mentioned resources?
error
Environment
Docker image : nvcr.io/nvidia/pytorch:23.02-py3 transformers : 4.33 colossalai : 0.3.6