huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.07k stars 546 forks source link

StableDiffusionImg2ImgPipeline OSError: Consistency check failed #1549

Open emreaniloguz opened 1 year ago

emreaniloguz commented 1 year ago

Describe the bug

I'm trying to run DreamPose Repository. When I finished fine-tuning the UNet, the code saved the fine-tuned network with this code snippet

            if accelerator.is_main_process and global_step % 500 == 0:
                pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
                    args.pretrained_model_name_or_path,
                    #adapter=accelerator.unwrap_model(adapter),
                    unet=accelerator.unwrap_model(unet),
                    tokenizer=tokenizer,
                    image_encoder=accelerator.unwrap_model(clip_encoder),
                    clip_processor=accelerator.unwrap_model(clip_processor),
                    revision=args.revision,
                )
                pipeline.save_pretrained(os.path.join(args.output_dir, f'checkpoint-{epoch}'))
                model_path = args.output_dir+f'/unet_epoch_{epoch}.pth'
                torch.save(unet.state_dict(), model_path)
                adapter_path = args.output_dir+f'/adapter_{epoch}.pth'
                torch.save(adapter.state_dict(), adapter_path)

It failed due to: OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors). (You can find the full output In the Logs section.)

No response

Logs

Fetching 14 files:   0%|                                 | 0/14 [00:00<?, ?it/s]Force download:  True
Force download:  True
Fetching 14 files:  21%|█████▎                   | 3/14 [00:06<00:23,  2.11s/it]
Traceback (most recent call last):
  File "finetune-unet.py", line 458, in <module>92M/492M [00:05<00:00, 85.9MB/s]
    main(args)
  File "finetune-unet.py", line 438, in main
    pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 908, in from_pretrained
    cached_folder = cls.download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 1349, in download
    cached_folder = snapshot_download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 235, in snapshot_download
    thread_map(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1365, in hf_hub_download
    http_get(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 547, in http_get
    raise EnvironmentError(
OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).
We are sorry for the inconvenience. Please retry download and pass `force_download=True, resume_download=False` as argument.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.
Downloading model.safetensors: 100%|█████████| 492M/492M [00:05<00:00, 83.3MB/s]
Steps: 100%|██████████████| 500/500 [06:10<00:00,  1.35it/s, loss=0.95, lr=1e-5]
Traceback (most recent call last):
  File "***/anaconda3/envs/***/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/launch.py", line 941, in launch_command
    simple_launcher(args)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['***/anaconda3/envs/***/bin/python', 'finetune-unet.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=demo/sample_emre/train', '--output_dir=demo/custom-chkpts_default', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=1e-5', '--num_train_epochs=500', '--dropout_rate=0.0', '--custom_chkpt=checkpoints/unet_epoch_20.pth', '--revision', 'ebb811dd71cdc38a204ecbdd6ac5d580f529fd8c', '--use_8bit_adam']' returned non-zero exit status 1.

System info

- huggingface_hub version: 0.15.1
- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: ***.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 1.13.1+cu116
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.0
- hf_transfer: N/A
- gradio: N/A
- numpy: 1.24.4
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: ***.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: ***.cache/huggingface/assets
- HF_TOKEN_PATH: ***.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
Wauplin commented 1 year ago

Hi @emreaniloguz, thanks for reporting the issue. Can you provide the url of the repo of the finetuned model on the Hub please? I would like to investigate it myself. If you can't make the model public for privacy reason, would it be possible to create an org, add the model to this org (as private) and add my account to the org so that I can have access to it. Also for completeness, can you paste the full code you use to instantiate the model? Thank you in advance.

emreaniloguz commented 1 year ago

Hi @Wauplin, I didn't exactly understand what you mean by "Can you provide the url of the repo of the finetuned model on the Hub please". If I understand correctly, you want me to share my final "fine-tuned" model, but there isn't because of the error. You can access the pre-trained model hub URL from here. Please correct me if I'm missing something.

Wauplin commented 1 year ago

Oh ok, I misunderstood the original issue then. So basically you try to download weights from https://huggingface.co/CompVis/stable-diffusion-v1-4 and you get this error ? Just to be sure, could you:

  1. Delete the cached repo: run huggingface-cli delete-cache and select "Model CompVis/stable-diffusion-v1-4". For a better CLI UI, it's best to install huggingface_hub[cli] first.
  2. Upgrade deps to pip install huggingface_hub==0.16.4. We released last week a fix in the HTTP session we use. I doubt it will fix your issue but it's worth trying.
  3. Retry the download.

I'm sorry in advance if you have a limited connection but this should cross-out some possible reasons for your bug and I'd like to try it before investigating further.

Wauplin commented 1 year ago

Wow actually the issue is very intriguing :exploding_head: It seems that for some reason the safety_checker/model.safetensors and the text_encoder/model.safetensors files have been mixed.

Here are the actual sizes of the files on S3:

➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/safety_checker/model.safetensors | grep size
x-linked-size: 1215981830
➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/text_encoder/model.safetensors | grep size
x-linked-size: 492265879

Given the error message you got (OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).), this cannot be a coincidence.

emreaniloguz commented 1 year ago

Oh ok, I misunderstood the original issue then. So basically you try to download weights from https://huggingface.co/CompVis/stable-diffusion-v1-4 and you get this error ? Just to be sure, could you:

1. Delete the cached repo: run `huggingface-cli delete-cache` and select `"Model CompVis/stable-diffusion-v1-4"`. For a better CLI UI, it's best to install `huggingface_hub[cli]` first.

2. Upgrade deps to `pip install huggingface_hub==0.16.4`. We released last week a fix in the HTTP session we use. I doubt it will fix your issue but it's worth trying.

3. Retry the download.

I'm sorry in advance if you have a limited connection but this should cross-out some possible reasons for your bug and I'd like to try it before investigating further.

I've done everything that you mentioned and started fine-tuning but the result is the same, OSError.

emreaniloguz commented 1 year ago

Wow actually the issue is very intriguing exploding_head It seems that for some reason the safety_checker/model.safetensors and the text_encoder/model.safetensors files have been mixed.

Here are the actual sizes of the files on S3:

➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/safety_checker/model.safetensors | grep size
x-linked-size: 1215981830
➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/text_encoder/model.safetensors | grep size
x-linked-size: 492265879

Given the error message you got (OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).), this cannot be a coincidence.

This is interesting :)

Wauplin commented 1 year ago

I've done everything that you mentioned and started fine-tuning but the result is the same, OSError.

Ok thanks for confirming. That's so weird :grimacing: I'll try to reproduce myself and let you know.

Wauplin commented 1 year ago

Just to be sure, what happens if you delete your cache and run

from diffusers import StableDiffusionImg2ImgPipeline

model = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

?

emreaniloguz commented 1 year ago

Just to be sure, what happens if you delete your cache and run

from diffusers import StableDiffusionImg2ImgPipeline

model = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

?

Here is my output:

[2023-07-10 14:01:43,910] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)ain/model_index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 38.6kB/s]
Downloading (…)69ce/vae/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 127kB/s]
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:13<00:00, 88.1MB/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:14<00:00,  1.09it/s]
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overridden.

I think it's the correct model.safetensors, right?

Wauplin commented 1 year ago

Hmmm, so no errors at all when using the one from diffusers... But I wouldn't say it is because of DreamPose implementation either since the failing part is really an internal consistency check within huggingface_hub :thinking:

(Though now that you successfully cached the repo locally, you should be able to continue with your training. It's not fixing the actual issue but at least unblock you, right?)

emreaniloguz commented 1 year ago

Hmmm, so no errors at all when using the one from diffusers... But I wouldn't say it is because of DreamPose implementation either since the failing part is really an internal consistency check within huggingface_hub thinking

(Though now that you successfully cached the repo locally, you should be able to continue with your training. It's not fixing the actual issue but at least unblock you, right?)

I'll share the result in 5 min.

emreaniloguz commented 1 year ago

The error is the same, but I think it should be related to the force_download parameter that I've hardcoded into the huggingface_hub library. The code tries to download text_encoder safe.tensors file. I'll change the library to its default version and give it a try. I'll also write here if it'll work.

emreaniloguz commented 1 year ago

I've first run this script where the safetensors are okay. Then I upgraded the huggingface-hub to the default structure where the force_download parameter is unchanged. Alas, the error remains.

Just to be sure, what happens if you delete your cache and run

from diffusers import StableDiffusionImg2ImgPipeline

model = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

?

Here is my output:

[2023-07-10 14:01:43,910] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)ain/model_index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 38.6kB/s]
Downloading (…)69ce/vae/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 551/551 [00:00<00:00, 127kB/s]
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.22G/1.22G [00:13<00:00, 88.1MB/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:14<00:00,  1.09it/s]
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overridden.

I think it's the correct model.safetensors, right?

Wauplin commented 1 year ago

@emreaniloguz Just to be sure, the error now is 'text_config_dict' is provided which will be used to initialize 'CLIPTextConfig'. The value 'text_config["id2label"]' will be overridden., right? So not related to the initial consistency check failure? If that's the case, it'd be best to open an issue on diffusers or DreamPose repository to get some more help.


Btw, our conversation made me realize that force_download was not correctly taken into account in diffusers, hence the hardcoded value that you needed to set. I've made a PR (https://github.com/huggingface/diffusers/pull/4036) so it should be fixed in next release or if you install from git source.

emreaniloguz commented 1 year ago

To update the issue, I've deleted the "revision" argument from everywhere and could overcome the problem but the results were not expected as I would. Someone else could try somewhere else also.