salesforce / ULIP

BSD 3-Clause "New" or "Revised" License
438 stars 44 forks source link

There may be a bug in data/dataset_3d.py #16

Closed auniquesun closed 1 year ago

auniquesun commented 1 year ago

I think there is a bug in data/dataset_3d.py when tokenizing the prompts in the data/templates.json

The specific point is shown as below, which is cited from the author's implementation.

https://github.com/salesforce/ULIP/blob/a7c625ff6c5cc0a979ab01a4a59709f4cd6c2b95/data/dataset_3d.py#L411

I think the right way is to tokenize the sentence with a inserted caption (object description in ShapeNet/ModelNet40) to the prompt template since here caption refers to a single word that describes the object name and we should tokenize a sentence rather than a word. So the right way is to replace above line with the following one.

tokenized_captions.append(self.tokenizer(template.format(caption)))

I would like to discuss wtih the author about this problem.

Tycho-Xue commented 1 year ago

@auniquesun, thanks for spotting this, just pushed the fix to this.

auniquesun commented 1 year ago

@auniquesun, thanks for spotting this, just pushed the fix to this.

Thanks to your reply.

So I question the results of ULIP, since the tokenized_captions loaded by train_loader are directly used in ULIP forward pass. The extracted features from previous text inputs may be quite different from the fixed implementation.

Tycho-Xue commented 1 year ago

Thanks for the concern, we actually tried both format of captions before, they turned out to give similar results, since the template is the same and give sort of static features. The current released checkpoints are the ones I reproduced using this released repo, so at least you should be able to achieve similar results. I’ll push a better and clearer version of this, maybe put this part as a flag to indicate what caption to use as a choice later today, thanks for the patience.

On Thu, Apr 6, 2023 at 10:29 AM Jerry Sun @.***> wrote:

@auniquesun https://github.com/auniquesun, thanks for spotting this, just pushed the fix to this.

Thanks to your reply.

So I question the results of ULIP, since the tokenized_captions loaded by train_loader are directly used in ULIP forward pass. The extracted features from previous text inputs may be quite different from the fixed implementation.

— Reply to this email directly, view it on GitHub https://github.com/salesforce/ULIP/issues/16#issuecomment-1498370907, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHTPKP5NWTYBJB3NZWJOUETW7YL73ANCNFSM6AAAAAAWT25X7E . You are receiving this because you modified the open/close state.Message ID: @.***>

Tycho-Xue commented 1 year ago

@auniquesun Hi, I just added a flag to indicate what caption to use in the code. I think when I clean and move the codes somehow the two approaches mixed, sorry for the confusion. Also, using templates to format the meta information is just one workaround to do that, since we don't have a proper caption for the shapenet dataset, if the pre-train dataset has a good caption itself, no templates for pre-training are needed. Thanks for your interest in our work!

auniquesun commented 1 year ago

@auniquesun Hi, I just added a flag to indicate what caption to use in the code. I think when I clean and move the codes somehow the two approaches mixed, sorry for the confusion. Also, using templates to format the meta information is just one workaround to do that, since we don't have a proper caption for the shapenet dataset, if the pre-train dataset has a good caption itself, no templates for pre-training are needed. Thanks for your interest in our work!

@Tycho-Xue Thanks for improving the implementation. I found this problem when I tried to understand the workflow of ULIP code. Since the paper said ULIP used 64 pre-defined prompt templates, but the code acutally didn't not incorporate them, so I created this issue.

I am very interested in this work because it relates to my research area. Currently, I am trying some new ideas upon this work. I hope the author can re-run pretraining and associated experiments if possible, and release the updated checkpoints. I think the results would be better than those reported in the paper because the templates provide more meaningful contexts than previous object/class words. Your efforts will benefit the community and shed light on the power of vision+language for 3D.

Overall, thanks for sharing the paper, code and checkpoints.