Closed yiyixuxu closed 1 year ago
ohh, sometimes it works (despite the error message) https://huggingface.co/YiYiXu/fill-circle-controlnet
@Wauplin cc.
Hi @yiyixuxu, I took a look at the code (both the script and huggingface_hub
internals).
About the error message, here's why it's happening:
repo.push_to_hub(..., blocking=False)
at the end of the training. blocking=False
means the push is run in the background. However, since it is the last line that is executed, the script exits just after.huggingface_hub
prevents the script from exiting if all commands are not completed. This is done using atexit.register(self.wait_for_commands)
wait_for_commands
, a while loop checks every seconds if the commands are all completed. If not, it logs an error message "waiting for the following commands to complete (...)" and wait for 10 seconds.=> So actually, this is not an error and the script works exactly as expected. If you wait long enough, the push_to_hub command will eventually be completed and your script will gracefully exit.
=> I think the only problem is that we log the "waiting for..." message as an ERROR which is misleading. Since it has been implemented 18 months ago (https://github.com/huggingface/huggingface_hub/pull/315) and that it's still quite used, I'm a bit reluctant in changing it without a second opinion. @LysandreJik @sgugger is that still used a lot in transformers
as well? Would it be ok to log only a warning to make it less scary for users?
Another short term solution for diffusers
is to set repo.push_to_hub(..., blocking=True)
which will block the script at the end of the training instead of running it in the background.
I also opened a related issue (https://github.com/huggingface/diffusers/issues/2860) to update the training scripts. It's not about solving an issue but more about improving the UX.
We don't rely on the level of the log in Transformers, so it's completely fine for me if it's downgraded to warning.
Ok, thanks for the quick feedback @sgugger. I think I'll update the log level then. I created an issue for it: https://github.com/huggingface/huggingface_hub/issues/1412.
thanks @Wauplin for the clarification! and yeah downgrade to the warning will be really helpful:)
Describe the bug
Not able to use
---push_to_hub
option for TPU traininggetting error
This is not a unique
train_text_to_image_flax.py
script. I'm just using it as an example. Basically, this line will always fail when called during training on a tpu https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_flax.py#L584Reproduction
run the train_text_to_image_flax script here with this command
https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax
Logs
System Info
tpu-v4-8