Great work! However, I noticed that there are some missing details regarding the model training in the paper (or perhaps I overlooked them; were they disclosed elsewhere?). I would greatly appreciate it if you could clarify them:
Which version of the data was used to train the MiraDiT model (the results in Table 3)? Was it the 330K, 93K, 42K, or 9K version?
Which version of the captions was used during training? Was it the short, dense, or structural captions?
Great work! However, I noticed that there are some missing details regarding the model training in the paper (or perhaps I overlooked them; were they disclosed elsewhere?). I would greatly appreciate it if you could clarify them:
Thank you in advance for your response!