not able to use `push_to_hub` during tpu training

yiyixuxu commented 1 year ago

Describe the bug

Not able to use ---push_to_hub option for TPU training

getting error

Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:59 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].

This is not a unique train_text_to_image_flax.py script. I'm just using it as an example. Basically, this line will always fail when called during training on a tpu https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_flax.py#L584

Reproduction

run the train_text_to_image_flax script here with this command

https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax

export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export dataset_name="lambdalabs/pokemon-blip-captions"
export OUTPUT_DIR="/pokemon"
export HUB_MODEL_ID="pokemon-lora"

python3 train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --max_train_steps=150 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model" \
  --push_to_hub \
  --hub_model_id=${HUB_MODEL_ID}

Logs

Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:13:38 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:13:48 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:13:58 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:08 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:18 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:28 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:38 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:48 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:58 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:09 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:19 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:29 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:39 - ERROR - huggingface_hub.repository - Waiting for the following commands to 

finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:49 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:59 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].

System Info

tpu-v4-8

yiyixuxu commented 1 year ago

ohh, sometimes it works (despite the error message) https://huggingface.co/YiYiXu/fill-circle-controlnet

sayakpaul commented 1 year ago

@Wauplin cc.

Wauplin commented 1 year ago

Hi @yiyixuxu, I took a look at the code (both the script and huggingface_hub internals).

About the error message, here's why it's happening:

The script is using repo.push_to_hub(..., blocking=False) at the end of the training. blocking=False means the push is run in the background. However, since it is the last line that is executed, the script exits just after.
huggingface_hub prevents the script from exiting if all commands are not completed. This is done using atexit.register(self.wait_for_commands)
In wait_for_commands, a while loop checks every seconds if the commands are all completed. If not, it logs an error message "waiting for the following commands to complete (...)" and wait for 10 seconds.

=> So actually, this is not an error and the script works exactly as expected. If you wait long enough, the push_to_hub command will eventually be completed and your script will gracefully exit.

=> I think the only problem is that we log the "waiting for..." message as an ERROR which is misleading. Since it has been implemented 18 months ago (https://github.com/huggingface/huggingface_hub/pull/315) and that it's still quite used, I'm a bit reluctant in changing it without a second opinion. @LysandreJik @sgugger is that still used a lot in transformers as well? Would it be ok to log only a warning to make it less scary for users?

Wauplin commented 1 year ago

Another short term solution for diffusers is to set repo.push_to_hub(..., blocking=True) which will block the script at the end of the training instead of running it in the background.

Wauplin commented 1 year ago

I also opened a related issue (https://github.com/huggingface/diffusers/issues/2860) to update the training scripts. It's not about solving an issue but more about improving the UX.

sgugger commented 1 year ago

We don't rely on the level of the log in Transformers, so it's completely fine for me if it's downgraded to warning.

Wauplin commented 1 year ago

Ok, thanks for the quick feedback @sgugger. I think I'll update the log level then. I created an issue for it: https://github.com/huggingface/huggingface_hub/issues/1412.

yiyixuxu commented 1 year ago

thanks @Wauplin for the clarification! and yeah downgrade to the warning will be really helpful:)

huggingface / diffusers