Open vince2003 opened 3 years ago
Hi, I implemented BeitForMaskedImageModeling
. It returns the exact same logits as the original implementation for the same pixel_values
and bool_masked_pos
. However, when testing it out on an image, none of the visual tokens it predicts correspond to the ground-truth visual tokens from DALL-E's tokenizer.
@donglixp would be great if you can take a look at my notebook: https://colab.research.google.com/drive/1Mjt-3jHw9HYMXECmSdDlbiG59ZAw-Z0T?usp=sharing
@addf400 could look at the above notebook for double-checking.
Is the OpenAI public decoder (https://cdn.openai.com/dall-e/decoder.pkl
) perhaps slightly different to the one used in this work? I am having the same issue where the reconstructed outputs from BeitForMaskedImageModeling
are much lower quality than the input, even when there is no masking applied. However, some similarity is there. I'd be interested to know if there has been any progress with this ~
Is the OpenAI public decoder (
https://cdn.openai.com/dall-e/decoder.pkl
) perhaps slightly different to the one used in this work? I am having the same issue where the reconstructed outputs fromBeitForMaskedImageModeling
are much lower quality than the input, even when there is no masking applied. However, some similarity is there. I'd be interested to know if there has been any progress with this ~
@vvvm23 Yes, we used this one.
# Download the tokenizer weight from OpenAI's DALL-E
TOKENIZER_PATH=/path/to/save/dall_e_tokenizer_weight
mkdir -p $TOKENIZER_PATH
wget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl
wget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl
@NielsRogge, @addf400 I used the same way in NielsRogge's notebook to use BEiT Model inference. I ran his notebook code multiple times, each time with a different bool_masked_pos. Among the 75 masked positions, I got either 0 or 1 or 2 correct predictions. So NielsRogge's post about 0 correct prediction is just random. If trying multiple times, it could be 1 or 2.
However, 0/1/2 are still quite low. Are these expected?
@LiweiPeng @NielsRogge @addf400 Have any of you been able to figure this out? I am getting similar results, the predictions for the masked areas are wrong and as a result the reconstructed (decoded) image has wrong and random values in the masked areas.
@LiweiPeng @NielsRogge @addf400 Have any of you been able to figure this out? I am getting similar results, the predictions for the masked areas are wrong and as a result the reconstructed (decoded) image has wrong and random values in the masked areas.
Based on my tests, my 'bad' results were expected for BeiT model. I used linear probing when I did my tests. Linear probing using BeiT base model doesn't work well. The reviewed BeiT paper on https://openreview.net/pdf?id=p-BhZSz59o4 Table 9 has some detailed results.
@LiweiPeng @NielsRogge @addf400 Have any of you been able to figure this out? I am getting similar results, the predictions for the masked areas are wrong and as a result the reconstructed (decoded) image has wrong and random values in the masked areas.
Based on my tests, my 'bad' results were expected for BeiT model. I used linear probing when I did my tests. Linear probing using BeiT base model doesn't work well. The reviewed BeiT paper on https://openreview.net/pdf?id=p-BhZSz59o4 Table 9 has some detailed results.
I'm not sure I understand, the linear probing is related to the final class prediction. The wrong predictions I am seeing are of the tokens, I would expect the reconstruction of correct tokens to work even with the pretrained features as no class specific information is needed for this.
Sorry I am not clear in my previous post. In my case, I am using linear probing for final class prediction. And linear probing result is not good compared to some other models like DINO. For your specific problem, I am not sure.
Dear Authors, I want to reconstruct the partially masked image to the original image. However, the final results are different from the original image. I use: +) BeitForMaskedImageModeling for Encoder (from https://huggingface.co/transformers/master/model_doc/beit.html) +) Decoder from dall_e (from https://github.com/openai/DALL-E) 1) Can you tell me why the reconstructed image and original image are the big differences? 2) Can you upload the final checkpoint of the decoder in self-learning for reconstructing masked images? This is my code: https://github.com/vince2003/recontruction/blob/main/beit_dall_simple.ipynb
Thank you!!!