Open bnuzhanyu opened 1 year ago
Hi @bnuzhanyu @bimalm Vanilla LLaMA it is only for inference. We have reimplemented it to make it suitable for training. We are working on stabilising the distributed training, we will keep you updated when a new stable release is available.
I downloaded the llama-7B model which MP=1. I modified the config: actor_config: device: "cuda:1,2,5,7" model: "llama-7B"
I tried:
torchrun --nproc_per_node=4 artifacts/main.py artifacts/config/config.yaml --type ACTOR
and get: AssertionError: Loading a checkpoint for MP=1 but world size is 4.I tried:
python artifacts/main.py artifacts/config/config.yaml
It seems it use cuda:0, and out of cuda memory.So, I have two questions: