hustvl / GaussianDreamer

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models (CVPR 2024)
https://taoranyi.com/gaussiandreamer/
Apache License 2.0
689 stars 36 forks source link

Assistance Requested for Replicating CLIP Score Calculations from Your Paper #35

Open Zhang-Jiahui opened 6 months ago

Zhang-Jiahui commented 6 months ago

Dear Author,

I hope this message finds you well.

Firstly, I would like to extend my sincere compliments on your remarkable work. It has greatly assisted us in our research endeavors. However, I have encountered some challenges regarding the computational method used for the values in Table 1, specifically titled "Quantitative comparisons on CLIP [55] similarity with other methods."

In my attempt to replicate the results for the GaussianDreamer using CLIP, I was unable to achieve the reported score of 27.23 ± 0.06, 41.88 ± 0.04 as presented in your paper. My approach involved generating 10 random images based on the camera angles described in your paper, post which I utilized ViT-L/14 and ViT-bigG-14 models to compute the CLIP scores. I successfully generated results for 411 out of 415 prompts provided in the Dreamfusion project for this computation.

The outcomes of my calculations are illustrated in the attached image. image

Could you kindly offer any guidance or share the specific code used for computing the CLIP scores as per your study? It would be incredibly helpful in understanding how to replicate the results you have achieved in your paper.

Thank you very much for your time and consideration. I am looking forward to your valuable response.

Best regards.

taoranyi commented 5 months ago

Dear Jiahui: Regarding the issue of a large variance in the evaluation results, I have checked my code and your description, and I suspect it might be a matter of the evaluation perspective. When selecting the evaluation method, we have found that randomly choosing both azimuth and elevation can lead to a significant variance in the evaluation. To reduce this randomness, we have adopted a compromise approach. During evaluation, we fix the elevation value and only randomly select azimuth values to avoid obtaining a large variance. In practice, we will randomly select images from save-it1200. The specific code can be found in the attachment. Best wishes. clip_sim.txt

Zhang-Jiahui commented 5 months ago

Dear Author,  I hope this message finds you in good health and spirits.  Following my previous communication regarding the replication of the CLIP scores presented in your paper on GaussianDreamer, I have attempted to use the code you provided to reproduce the results. Despite my efforts, I still find discrepancies between the scores reported in your paper and those I am able to achieve.  Specifically, using the provided code and the same methodology described in your paper, I am unable to replicate the CLIP scores of 27.23 ± 0.06 and 41.88 ± 0.04. I have meticulously followed the instructions, generating 10 random images based on the specified camera angles and computing the CLIP scores using both ViT-L/14 and ViT-bigG-14 models. I have successfully processed 411 out of 415 prompts from the Dreamfusion project.  The results of my latest attempts are included in the attached image for your review. The results are still the same values whether I use your code or my code about clip.  GaussianDreamer_0_laion-CLIP-ViT-bigG-14-laion2B-39B-b160k.json GaussianDreamer_0_openai-clip-vit-large-patch14.json image image

Could you please provide additional insights or possibly a more detailed version of the code used in your study? Any additional parameters, configurations, or preprocessing steps that might be crucial for achieving the reported scores would be immensely helpful.  In addition, upon executing the provided code, I observed that the CLIP scores I obtained were consistently within the range of 0 to 1, such as a mean of 0.40. This differs from the scores reported in your paper, which appear to be multiplied by 100, resulting in values like 41.88. It seems that while the mean scores were scaled up by a factor of 100, the standard deviations were not adjusted accordingly, remaining at their original values as reported in the code.  This discrepancy could potentially lead to confusion among readers and researchers attempting to replicate your results. It would be beneficial to clarify whether the scaling of the mean scores was intentional and, if so, why the standard deviations were not similarly adjusted. It seems like this is an error.  image

Could you please provide an explanation for this scaling method and confirm whether the values in the paper should be interpreted as scaled or unscaled? Your clarification on this matter would be invaluable for ensuring the accuracy and integrity of the research based on your work.  Thank you very much for your attention to this matter. I look forward to your response and any additional guidance you may provide.  Best regards.

taoranyi commented 4 months ago

Hello Jiahui, For our CLIP similarity issue, we present the JSON outcomes of our calculations. Unlike other approaches and the results you've shared, we employed all 415 prompts, as noted in Section A.1 of our paper. For the two ViT models—vit-l14 and vit-bigG-14—we compute the similarity ten times, each instance featuring different camera poses for the 3D assets, yet all based on the same once generation outcome. Hence, our ±0.06 and ±0.04 denote the variance across sampling various camera poses within the same once generation, specifically the variation across these ten computations, instead of differences between separate generative results. We do not assess discrepancies arising from diverse seeds and PyTorch versions affecting generative outcomes. Given the significant impact of generation outcomes by PyTorch and Lightning versions, we specify our evaluation versions: PyTorch=2.0.1, Lightning=2.0.0, to ensure reproducibility. In terms of table, fellowing instant3D, we scale both the variance and the score by a factor of 100, thus there's no issue of the variance not being amplified accordingly. ViT-bigG-14_415_10_random_0.json ViT-bigG-14_415_10_random_1.json ViT-bigG-14_415_10_random_2.json ViT-bigG-14_415_10_random_3.json ViT-bigG-14_415_10_random_4.json ViT-bigG-14_415_10_random_5.json ViT-bigG-14_415_10_random_6.json ViT-bigG-14_415_10_random_7.json ViT-bigG-14_415_10_random_8.json ViT-bigG-14_415_10_random_9.json ViT-L14_415_10_random_0.json ViT-L14_415_10_random_1.json ViT-L14_415_10_random_2.json ViT-L14_415_10_random_3.json ViT-L14_415_10_random_4.json ViT-L14_415_10_random_5.json ViT-L14_415_10_random_6.json ViT-L14_415_10_random_7.json ViT-L14_415_10_random_8.json ViT-L14_415_10_random_9.json