Can't run examples on Windows 10

mhamra commented 1 year ago

Hi, I've tried to run the examples, but I received this error.

(CodeLlama) PS C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama> python -m torch.distributed.run --nproc_per_node 1 example_infilling.py --ckpt_dir CodeLlama-7b-Python --tokenizer_path ./CodeLlama-7b-Python/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\example_infilling.py", line 79, in <module>
    fire.Fire(main)
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\example_infilling.py", line 18, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\llama\generation.py", line 90, in build
    checkpoint = torch.load(ckpt_path, map_location="cpu")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, '<'.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18284) of binary: C:\ProgramData\anaconda3\envs\CodeLlama\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 798, in <module>
    main()
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_infilling.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-28_12:39:51
  host      : DESKTOP-THP4I5R
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 18284)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs

mhamra commented 1 year ago

UPDATE

I've made a mistake running the download.sh script. I've passed my email instead of the URL received from FB.

manoj21192 commented 1 year ago

Did your issue resolved? I am unable to run on windows 10 as well. I am getting "Distributed package doesnt have NCCL built-in error"

realhaik commented 1 year ago

@manoj21192 This will work on windows


temperature  = 0
top_p  = 0
max_seq_len  = 4096
max_batch_size  = 1
max_gen_len  = None
num_of_worlds = 1

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

generator = Llama.build(

    ckpt_dir="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct",
    tokenizer_path="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct/tokenizer.model",
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    model_parallel_size = num_of_worlds
)

99991 commented 1 year ago

UPDATE

I've made a mistake running the download.sh script. I've passed my email instead of the URL received from FB.

Thank you! I can reproduce this. I at first entered my email and then noticed my error and entered the correct URL when running download.sh, but loading was still not possible.

I cloned the repository again, entered the correct URL on first try and then it worked.

bronzwikgk commented 1 year ago

What mistake am I making here? from typing import Optional

import fire

from llama import Llama

def main( ckpt_dir: "D:\pathto\codellama\CodeLlama-7b", tokenizer_path: "D:\pathto\codellama\CodeLlama-7b\tokenizer.model", temperature: float = 0.2, top_p: float = 0.9, max_seq_len: int = 256, max_batch_size: int = 4, max_gen_len: Optional[int] = None, ): generator = Llama.build( ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size, ) "

I Am getting this error: "

D:\path2\codellama>python example_completion.py ERROR: The function received no value for the required argument: ckpt_dir Usage: example_completion.py CKPT_DIR TOKENIZER_PATH optional flags: --temperature | --top_p | --max_seq_len | --max_batch_size | --max_gen_len

For detailed information on this command, run: example_completion.py --help "

realhaik commented 1 year ago

What mistake am I making here? from typing import Optional

import fire

from llama import Llama

def main( ckpt_dir: "D:\pathto\codellama\CodeLlama-7b", tokenizer_path: "D:\pathto\codellama\CodeLlama-7b\tokenizer.model", temperature: float = 0.2, top_p: float = 0.9, max_seq_len: int = 256, max_batch_size: int = 4, max_gen_len: Optional[int] = None, ): generator = Llama.build( ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size, ) "

I Am getting this error: "

D:\path2\codellama>python example_completion.py ERROR: The function received no value for the required argument: ckpt_dir Usage: example_completion.py CKPT_DIR TOKENIZER_PATH optional flags: --temperature | --top_p | --max_seq_len | --max_batch_size | --max_gen_len

For detailed information on this command, run: example_completion.py --help "

@bronzwikgk

Based on the code and error message you've provided, here are some issues I've identified:

The type hints in the function arguments are actually string literals, which is incorrect syntax for Python.
The paths should be properly escaped or defined as raw strings.

Here's a revised version of the code:

from typing import Optional
import fire
from llama import Llama

def main(
    ckpt_dir: str = r"D:\pathto\codellama\CodeLlama-7b",
    tokenizer_path: str = r"D:\pathto\codellama\CodeLlama-7b\tokenizer.model",
    temperature: float = 0.2,
    top_p: float = 0.9,
    max_seq_len: int = 256,
    max_batch_size: int = 4,
    max_gen_len: Optional[int] = None,
):
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
    )

if __name__ == "__main__":
    fire.Fire(main)

Fixed the type hints for ckpt_dir and tokenizer_path to be str.
Used raw string literals for the Windows paths (by prefixing the string with an r), which allow for backslashes to be interpreted correctly.
Added if __name__ == "__main__": fire.Fire(main) to run the function when the script is executed.

Try running the updated code and see if the error persists.

bronzwikgk commented 1 year ago

Thanks, Moved One step ahead. Getting this error now: {{ Traceback (most recent call last): File "D:\shunyadotek\codellama\example_completion.py", line 55, in fire.Fire(main) File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\shunyadotek\codellama\example_completion.py", line 20, in main generator = Llama.build( ^^^^^^^^^^^^ File "D:\shunyadotek\codellama\llama\generation.py", line 68, in build torch.distributed.init_process_group("nccl") File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\rendezvous.py", line 235, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\rendezvous.py", line 220, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set }}

realhaik commented 1 year ago

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

@bronzwikgk I don't see this line in your code : torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

Are you sure you have it in your code? See my answer with the full code with this line, few answers above.

realhaik commented 1 year ago

@bronzwikgk Right, I see that you are using torch.distributed.init_process_group("nccl") nccl is for linux only, use my example above.

meta-llama / codellama

Can't run examples on Windows 10 #55

UPDATE

UPDATE