gligen / GLIGEN

Open-Set Grounded Text-to-Image Generation
MIT License
1.98k stars 148 forks source link

Questions about the implementation: #20

Open Estrellama opened 1 year ago

Estrellama commented 1 year ago

Hi, thanks your good work. A few small questions about the implementation:

  1. How long did your grounded with text experiment take to converge after training?How do I convert the 100K iterations you mentioned into epoch numbers? I don't know how to evaluate how long it takes to converge on my custom dataset

  2. why did you set max_boxes_per_data=30? Does it not work well when it is bigger? Any experience here?

  3. Why do you have to go through the "scale to 0-1" operation first when processing the box coordinates? If the aspect ratio of my image is not 1:1, will the normalization operation cause the coordinates to not correspond to the original image?

I would like to get your answer!

Yuheng-Li commented 1 year ago

1) 100k iterations generally takes 1-2 days. For COCO we train our model using 100k (which is around 80 epoches), but we train 500k for GoldG+SBU+CC3M+O365 datasets. We actually did not notice that our model is fully converged, we recommend to try at least 500k if your custom dataset is large.

2) we empirically found that most images have bounding boxes less than 30 in training data; and in practice, users usually do not draw such many boxes. Our model should still work if you set it with larger number, say 100, but this will take more memory during training.

3) we normalize to 0-1 simply considering the sin and cos operation. (it should be fine if you use raw coordinates; some nerf paper doing this). In our dataset, we will crop the image into 512*512 and recalculate boxes coordinates if the training image is not square

Estrellama commented 1 year ago

Thank you for your sincere reply. Reply 1: The same phenomenon is that the loss keeps decreasing with no trend of convergence while I am training.

There is another evaluation question that has been bothering me for a long time. During the training process, how can I see in real time whether the model has adapted to the grounding condition we introduced? I am currently just looking at a few images of the intermediate sampling results saved during the training process, and this method of looking at a few images with the naked eye is very unintuitive.

Any good suggestions, thanks!

Bailey-24 commented 7 months ago

why gligen is always output 512512, how to output 640480