tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.2k stars 4.02k forks source link

weight_diff AssertionError: Naive integrity check failed. This could imply that some of the checkpoint files are corrupted. #256

Open abdoelsayed2016 opened 1 year ago

abdoelsayed2016 commented 1 year ago
Traceback (most recent call last):
  File "/gpfs/gpfs1/scratch/c7031420/stanford_alpaca/weight_diff.py", line 158, in <module>
    fire.Fire(main)
  File "/.local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/.local/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/.local/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/stanford_alpaca/weight_diff.py", line 154, in main
    globals()[task](**kwargs)
  File "/.conda/envs/llama_2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/stanford_alpaca/weight_diff.py", line 130, in recover
    assert torch.allclose(
AssertionError: Naive integrity check failed. This could imply that some of the checkpoint files are corrupted.

python weight_diff.py recover --path_raw './PR_7B' --path_diff './output' --path_tuned './recover'

shiyanlou-015555 commented 1 year ago

i have the same question

shiyanlou-015555 commented 1 year ago

Is this inspection necessary? I feel that as long as we ensure that what we download is the weight of Huggingface, there should be no problem.

omamaatautolabs commented 8 months ago

I am facing the same issue. Is there any solution to this? I downloaded the weights of llama-2-7b-hf from hugging face and wdiff-7b-alpaca from hf, and now the code exited with the afore mentioned error...

woody8657 commented 8 months ago

Same issue as above...

omamaatautolabs commented 8 months ago

@woody8657, one possible solution that I came across while skimming through the code file weight_diff.py in github.com/tatsu-lab/stanford_alpaca is to toggle the boolean value of check_integrity_naively, at line 77 to False. In this way the below check starting at line 127

if check_integrity_naively:
        # This is not a rigorous, cryptographically strong integrity check :)
        allsum = sum(state_dict_recovered[key].sum() for key in state_dict_recovered)
        assert torch.allclose(
            allsum, torch.full_like(allsum, fill_value=50637.1836), atol=1e-2, rtol=0
        ), "Naive integrity check failed. This could imply that some of the checkpoint files are corrupted."

does not execute and restoration of weights goes successful

Ki-Seki commented 5 months ago

Same problem

Ki-Seki commented 5 months ago

@woody8657, one possible solution that I came across while skimming through the code file weight_diff.py in github.com/tatsu-lab/stanford_alpaca is to toggle the boolean value of check_integrity_naively, at line 77 to False. In this way the below check starting at line 127

if check_integrity_naively:
        # This is not a rigorous, cryptographically strong integrity check :)
        allsum = sum(state_dict_recovered[key].sum() for key in state_dict_recovered)
        assert torch.allclose(
            allsum, torch.full_like(allsum, fill_value=50637.1836), atol=1e-2, rtol=0
        ), "Naive integrity check failed. This could imply that some of the checkpoint files are corrupted."

does not execute and restoration of weights goes successful

I bypass the integrity check without modifying the source code by using the CLI argument --nocheck_integrity_naively. Simply run the command as follows: python weight_diff.py recover --nocheck_integrity_naively --path_raw <path_to_step_1_dir> --path_diff <path_to_step_2_dir> --path_tuned <path_to_store_recovered_weights>

baotruyenthach commented 1 month ago

Thank you so much, @Ki-Seki . Your solution works!!