meta-llama / codellama

Inference code for CodeLlama models
Other
16.06k stars 1.88k forks source link

Address family not supported by protocol Error #215

Open mehulparmariitr opened 8 months ago

mehulparmariitr commented 8 months ago

On running samples I am getting this error. I want to generate code context/documentation in simple language when provided a code in java. For that is codellama better or llama?

myenv) [10:52]:[mehparmar@py029:codellama-main]$ torchrun --nproc_per_node 1 example_infilling.py \
>     --ckpt_dir CodeLlama-7b/ \
>     --tokenizer_path CodeLlama-7b/tokenizer.model \
>     --max_seq_len 192 --max_batch_size 4
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "example_infilling.py", line 79, in <module>
    fire.Fire(main)
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_infilling.py", line 18, in main
    generator = Llama.build(
  File "/vol/etl_jupyterdata1/home/github/public/Sreeramm/codellama-main/llama/generation.py", line 97, in build
    assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
AssertionError: no checkpoint files found in CodeLlama-7b/
[2024-03-16 10:54:20,433] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 75378) of binary: /home/mehparmar/.conda/envs/myenv/bin/python
Traceback (most recent call last):
  File "/home/mehparmar/.conda/envs/myenv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/mehparmar/.conda/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_infilling.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-16_10:54:20
  host      : py029.lvs.abc.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 75378)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
KC888-cpu commented 2 days ago

hi, i face this error too, did you solve this problem now? could you please share some ideas. thank you so much!