khanrc / honeybee

Official implementation of project Honeybee (CVPR 2024)
Other
428 stars 19 forks source link

error executing Evaluation #7

Closed caramel678 closed 10 months ago

caramel678 commented 10 months ago

(bee) D:\honeybee-main>torchrun --nproc_per_node=auto --standalone eval_tasks.py --ckpt_path checkpoints/13B-C-Abs-M576/last --config configs/tasks/sqa.yaml NOTE: Redirects are currently not supported in Windows or MacOs. master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Namespace(ckpt_path='checkpoints/13B-C-Abs-M576/last', result_dir='eval_results/', config=['configs/tasks/sqa.yaml'], load_results=False, dump_submission_file=False, batch_size=None) INFO 01/04 15:19:14 | Init (load model, tokenizer, processor) ... Traceback (most recent call last): File "D:\honeybee-main\eval_tasks.py", line 152, in model, tokenizer, processor = init(args.ckpt_path, args.load_results) File "D:\honeybee-main\eval_tasks.py", line 60, in init model, tokenizer, processor = get_model(ckpt_path) File "D:\honeybee-main\pipeline\interface.py", line 74, in get_model model = load_model(pretrained_ckpt, use_bf16, load_in_8bit) File "D:\honeybee-main\pipeline\interface.py", line 53, in load_model model = HoneybeeForConditionalGeneration.from_pretrained( File "D:\anaconda\envs\bee\lib\site-packages\transformers\modeling_utils.py", line 2305, in from_pretrained config, model_kwargs = cls.config_class.from_pretrained( File "D:\anaconda\envs\bee\lib\site-packages\transformers\configuration_utils.py", line 547, in from_pretrained config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, kwargs) File "D:\anaconda\envs\bee\lib\site-packages\transformers\configuration_utils.py", line 574, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "D:\anaconda\envs\bee\lib\site-packages\transformers\configuration_utils.py", line 629, in _get_config_dict resolved_config_file = cached_file( File "D:\anaconda\envs\bee\lib\site-packages\transformers\utils\hub.py", line 388, in cached_file raise EnvironmentError( OSError: checkpoints/13B-C-Abs-M576/last does not appear to have a file named config.json. Checkout 'https://huggingface.co/checkpoints/13B-C-Abs-M576/last/None' for available files. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6364) of binary: D:\anaconda\envs\bee\python.exe Traceback (most recent call last): File "D:\anaconda\envs\bee\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\anaconda\envs\bee\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\anaconda\envs\bee\Scripts\torchrun.exe__main.py", line 7, in File "D:\anaconda\envs\bee\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init__.py", line 346, in wrapper return f(*args, **kwargs) File "D:\anaconda\envs\bee\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "D:\anaconda\envs\bee\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\anaconda\envs\bee\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "D:\anaconda\envs\bee\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

eval_tasks.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-04_15:19:16 host : SK-20230830XPTL rank : 0 (local_rank: 0) exitcode : 1 (pid: 6364) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ I have downloaded the checkpoint and the content inside is only *pytorch_ model. bin* , error encountered while executing Evaluation. How can I resolve this issue
khanrc commented 10 months ago

We found that the 13B-C-Abs-M576 checkpoint was crashed, and fixed at 1/5. Please download it again.