train result (100epoch) is not good. where is it wrong?

failbetter77 commented 5 months ago

I'm training from initial.

the result is very not good. I don't know where I went wrong.

train scheduler : { "_class_name": "PNDMScheduler", "_diffusers_version": "0.6.0", "beta_end": 0.012, "beta_schedule": "scaled_linear", "beta_start": 0.00085, "num_train_timesteps": 1000, "set_alpha_to_one": false, "skip_prk_steps": true, "steps_offset": 1, "trained_betas": null, "clip_sample": false }

clip-vit-large-patch14, preprocessor_config.json { "crop_size": 224, "do_center_crop": true, "do_normalize": true, "do_resize": true, "feature_extractor_type": "CLIPFeatureExtractor", "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "resample": 3, "size": 224 }

GeneralWhite commented 5 months ago

Hello, what does the loss look like during your training process? I have only trained for a few dozen epochs so far, but found that the model did not converge. 😕 😕

failbetter77 commented 5 months ago

Hello, what does the loss look like during your training process? I have only trained for a few dozen epochs so far, but found that the model did not converge. 😕 😕

start 0.1(0 epoch)~ drop near 0.02(100epoch) ,

GeneralWhite commented 5 months ago

I have only trained for a dozen epochs so far. My loss has been oscillating around 0.02 but has not converged. I suspect that if I continue training for 100 epochs, the loss will still not converge and oscillating around 0.02. 🤔 🤔

failbetter77 commented 5 months ago

I have only trained for a dozen epochs so far. My loss has been oscillating around 0.02 but has not converged. I suspect that if I continue training for 100 epochs, the loss will still not converge and oscillating around 0.02. 🤔 🤔

My graph is the same as yours. ;; how about your result? can you share it?

GeneralWhite commented 5 months ago

I have only trained for a dozen epochs so far. My loss has been oscillating around 0.02 but has not converged. I suspect that if I continue training for 100 epochs, the loss will still not converge and oscillating around 0.02. 🤔 🤔

My graph is the same as yours. ;; how about your result? can you share it?

Sorry, I set up to save the model every 50 epochs, but I haven't reached 50 epochs yet. :disappointed::disappointed: I will send you the results when I complete the training.

Thank you for your help.

sonnv174 commented 5 months ago

I'm trying to train on 2xA5000 (2x24GB) with batch-size = 1 but still be CUDA out of memory. Could you share your condition information (GPU, accelerate config, ... ) to train from the initial?

I have set to train with LoRA and got the loss during 30 epochs similar to yours.

GeneralWhite commented 5 months ago

I'm trying to train on 2xA5000 (2x24GB) with batch-size = 1 but still be CUDA out of memory. Could you share your condition information (GPU, accelerate config, ... ) to train from the initial?

I have set to train with LoRA and got the loss during 30 epochs similar to yours.

I attempted training with a 4x4090 GPU (24GB), but regardless of how I adjusted the parameters, I encountered an 'OOM' (Out of Memory) error. Now, I'm using A100 GPU for training this model.

Sheldongg commented 5 months ago

这结果也太差了

sonnv174 commented 5 months ago

I attempted training with a 4x4090 GPU (24GB), but regardless of how I adjusted the parameters, I encountered an 'OOM' (Out of Memory) error. Now, I'm using A100 GPU for training this model.

Oh, thanks for your reply. It's probably not possible to train this pipeline with the 24GB GPU.

lyc0929 / OOTDiffusion-train

train result (100epoch) is not good. where is it wrong? #22