Closed Li-Zn-H closed 2 months ago
At the same time, I did the experiment again. After backpropagation, prompt_embeds, add_time_ids, and pooled_prompt_embeds are normal
The first occurrence of nan is during the computation of controlnet after the first backpropagation down_block_res_samples, mid_block_res_sample = controlnet( noisy_latents, timesteps, encoder_hidden_states=prompt_embeds, added_cond_kwargs={"text_embeds": pooled_prompt_embeds, "time_ids": add_time_ids}, controlnet_cond=controlnet_image, return_dict=False, )
cc: @sayakpaul
This should likely be a discussion and not an issue because on the example dataset folks were able to train successfully.
There could be many reasons for this kind of behaviour but the first thing I would try is overfit a single batch of data.
This should likely be a discussion and not an issue because on the example dataset folks were able to train successfully.
There could be many reasons for this kind of behaviour but the first thing I would try is overfit a single batch of data.
I'm sorry I submitted it in the wrong place, because this is also the first time I've had a problem that hasn't been solved for two days. Just now I tried it again, and I found that it works when I specify float32 for all my models and variables, whereas before I've been using float16, or tried mixed precision training, All encountered the bugs I described (well, bugs, 🤦).
Describe the bug
A strange thing happened when I wrote my own code to train cotrolnet, as soon as I did the first backpropagation, noise_pred became nan. I did a lot of debugging, gradient decay, mixed precision training, removing ema and other parts, but the result was always nan once backpropagation was applied
Reproduction
my model and dataset setting
Logs
No response
System Info
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
Who can help?
No response