Running from a notebook fails when trying to setup torch.distributed

ccozad commented 4 months ago

I'm running from a local Jupyter Notebook on Windows. I'm attempting to port the chat example and I get errors about initializing torch.distributed. (RANK not defined, MASTER_ADDR not defined, etc.) I tried following the manual nccl steps outlined here: https://stackoverflow.com/questions/56805951/valueerror-error-initializing-torch-distributed-using-env-rendezvous-enviro but I just get a loop with failing to connect.

It looks like the start of Llama.build() is where things are erroring out.

if not torch.distributed.is_initialized():
     torch.distributed.init_process_group("nccl")

I'll be going through the nccl debug process and will eventually switch to Linux if needed but first my questions.

Are there any plans to have an option of bypassing the need for torch.distributed?
Is there a way to load the pre-trained model for inference without torch.distributed?
Has anyone been successful interacting with a local model from a notebook?

ccozad commented 4 months ago

Maybe supporting configurable back ends for torch.distributed is an option? https://pytorch.org/docs/stable/distributed.html

subramen commented 4 months ago

Hi! The example scripts in this repo are for running inference on single (for 8B) and multi (for 70B) GPU setups using CUDA, but Windows is not currently supported.

You might want to check out these examples for running Llama locally / without distributed via hugging face or ollama https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/Running_Llama2_Anywhere

ccozad commented 4 months ago

@subramen Thank you for the confirmation.

I setup a linux machine on AWS and got things to run. I put together a guide here: https://github.com/ccozad/ml-reference-designs/blob/master/llm/llama-3/hello-world/README.md

Perhaps in the future Microsoft, Nvidia and other vendors will open more options to put gaming computers to good use.

ccozad commented 4 months ago

@subramen See my comment on #127 , I was able to get the model to build on Windows by initializing the gloo backend before calling Llama.build()

meta-llama / llama3

Running from a notebook fails when trying to setup torch.distributed #132