meta-llama / codellama

Inference code for CodeLlama models
Other
15.93k stars 1.85k forks source link

Can't run examples on Windows 10 #55

Open mhamra opened 1 year ago

mhamra commented 1 year ago

Hi, I've tried to run the examples, but I received this error.

(CodeLlama) PS C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama> python -m torch.distributed.run --nproc_per_node 1 example_infilling.py --ckpt_dir CodeLlama-7b-Python --tokenizer_path ./CodeLlama-7b-Python/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\example_infilling.py", line 79, in <module>
    fire.Fire(main)
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\example_infilling.py", line 18, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\llama\generation.py", line 90, in build
    checkpoint = torch.load(ckpt_path, map_location="cpu")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, '<'.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18284) of binary: C:\ProgramData\anaconda3\envs\CodeLlama\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 798, in <module>
    main()
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_infilling.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-28_12:39:51
  host      : DESKTOP-THP4I5R
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 18284)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs
mhamra commented 1 year ago

UPDATE

I've made a mistake running the download.sh script. I've passed my email instead of the URL received from FB.

manoj21192 commented 1 year ago

Did your issue resolved? I am unable to run on windows 10 as well. I am getting "Distributed package doesnt have NCCL built-in error"

realhaik commented 1 year ago

@manoj21192 This will work on windows


temperature  = 0
top_p  = 0
max_seq_len  = 4096
max_batch_size  = 1
max_gen_len  = None
num_of_worlds = 1

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

generator = Llama.build(

    ckpt_dir="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct",
    tokenizer_path="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct/tokenizer.model",
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    model_parallel_size = num_of_worlds
)
99991 commented 1 year ago

UPDATE

I've made a mistake running the download.sh script. I've passed my email instead of the URL received from FB.

Thank you! I can reproduce this. I at first entered my email and then noticed my error and entered the correct URL when running download.sh, but loading was still not possible.

I cloned the repository again, entered the correct URL on first try and then it worked.

bronzwikgk commented 1 year ago

What mistake am I making here? from typing import Optional

import fire

from llama import Llama

def main( ckpt_dir: "D:\pathto\codellama\CodeLlama-7b", tokenizer_path: "D:\pathto\codellama\CodeLlama-7b\tokenizer.model", temperature: float = 0.2, top_p: float = 0.9, max_seq_len: int = 256, max_batch_size: int = 4, max_gen_len: Optional[int] = None, ): generator = Llama.build( ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size, ) "

I Am getting this error: "

D:\path2\codellama>python example_completion.py ERROR: The function received no value for the required argument: ckpt_dir Usage: example_completion.py CKPT_DIR TOKENIZER_PATH optional flags: --temperature | --top_p | --max_seq_len | --max_batch_size | --max_gen_len

For detailed information on this command, run: example_completion.py --help "

realhaik commented 1 year ago

What mistake am I making here? from typing import Optional

import fire

from llama import Llama

def main( ckpt_dir: "D:\pathto\codellama\CodeLlama-7b", tokenizer_path: "D:\pathto\codellama\CodeLlama-7b\tokenizer.model", temperature: float = 0.2, top_p: float = 0.9, max_seq_len: int = 256, max_batch_size: int = 4, max_gen_len: Optional[int] = None, ): generator = Llama.build( ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size, ) "

I Am getting this error: "

D:\path2\codellama>python example_completion.py ERROR: The function received no value for the required argument: ckpt_dir Usage: example_completion.py CKPT_DIR TOKENIZER_PATH optional flags: --temperature | --top_p | --max_seq_len | --max_batch_size | --max_gen_len

For detailed information on this command, run: example_completion.py --help "

@bronzwikgk

Based on the code and error message you've provided, here are some issues I've identified:

  1. The type hints in the function arguments are actually string literals, which is incorrect syntax for Python.
  2. The paths should be properly escaped or defined as raw strings.

Here's a revised version of the code:

from typing import Optional
import fire
from llama import Llama

def main(
    ckpt_dir: str = r"D:\pathto\codellama\CodeLlama-7b",
    tokenizer_path: str = r"D:\pathto\codellama\CodeLlama-7b\tokenizer.model",
    temperature: float = 0.2,
    top_p: float = 0.9,
    max_seq_len: int = 256,
    max_batch_size: int = 4,
    max_gen_len: Optional[int] = None,
):
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
    )

if __name__ == "__main__":
    fire.Fire(main)
  1. Fixed the type hints for ckpt_dir and tokenizer_path to be str.
  2. Used raw string literals for the Windows paths (by prefixing the string with an r), which allow for backslashes to be interpreted correctly.
  3. Added if __name__ == "__main__": fire.Fire(main) to run the function when the script is executed.

Try running the updated code and see if the error persists.

bronzwikgk commented 1 year ago

Thanks, Moved One step ahead. Getting this error now: {{ Traceback (most recent call last): File "D:\shunyadotek\codellama\example_completion.py", line 55, in fire.Fire(main) File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\shunyadotek\codellama\example_completion.py", line 20, in main generator = Llama.build( ^^^^^^^^^^^^ File "D:\shunyadotek\codellama\llama\generation.py", line 68, in build torch.distributed.init_process_group("nccl") File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\rendezvous.py", line 235, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\rendezvous.py", line 220, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set }}

realhaik commented 1 year ago

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

@bronzwikgk I don't see this line in your code : torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

Are you sure you have it in your code? See my answer with the full code with this line, few answers above.

realhaik commented 1 year ago

@bronzwikgk Right, I see that you are using torch.distributed.init_process_group("nccl") nccl is for linux only, use my example above.