microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.08k stars 2.55k forks source link

recontructed images are differenct from original image in self-learning for BeitForMaskedImageModeling #401

Open vince2003 opened 3 years ago

vince2003 commented 3 years ago

Dear Authors, I want to reconstruct the partially masked image to the original image. However, the final results are different from the original image. I use: +) BeitForMaskedImageModeling for Encoder (from https://huggingface.co/transformers/master/model_doc/beit.html) +) Decoder from dall_e (from https://github.com/openai/DALL-E) 1) Can you tell me why the reconstructed image and original image are the big differences? 2) Can you upload the final checkpoint of the decoder in self-learning for reconstructing masked images? This is my code: https://github.com/vince2003/recontruction/blob/main/beit_dall_simple.ipynb

Thank you!!!

NielsRogge commented 3 years ago

Hi, I implemented BeitForMaskedImageModeling. It returns the exact same logits as the original implementation for the same pixel_values and bool_masked_pos. However, when testing it out on an image, none of the visual tokens it predicts correspond to the ground-truth visual tokens from DALL-E's tokenizer.

@donglixp would be great if you can take a look at my notebook: https://colab.research.google.com/drive/1Mjt-3jHw9HYMXECmSdDlbiG59ZAw-Z0T?usp=sharing

donglixp commented 3 years ago

@addf400 could look at the above notebook for double-checking.

vvvm23 commented 3 years ago

Is the OpenAI public decoder (https://cdn.openai.com/dall-e/decoder.pkl) perhaps slightly different to the one used in this work? I am having the same issue where the reconstructed outputs from BeitForMaskedImageModeling are much lower quality than the input, even when there is no masking applied. However, some similarity is there. I'd be interested to know if there has been any progress with this ~

donglixp commented 2 years ago

Is the OpenAI public decoder (https://cdn.openai.com/dall-e/decoder.pkl) perhaps slightly different to the one used in this work? I am having the same issue where the reconstructed outputs from BeitForMaskedImageModeling are much lower quality than the input, even when there is no masking applied. However, some similarity is there. I'd be interested to know if there has been any progress with this ~

@vvvm23 Yes, we used this one.

# Download the tokenizer weight from OpenAI's DALL-E
TOKENIZER_PATH=/path/to/save/dall_e_tokenizer_weight
mkdir -p $TOKENIZER_PATH
wget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl
wget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl
LiweiPeng commented 2 years ago

@NielsRogge, @addf400 I used the same way in NielsRogge's notebook to use BEiT Model inference. I ran his notebook code multiple times, each time with a different bool_masked_pos. Among the 75 masked positions, I got either 0 or 1 or 2 correct predictions. So NielsRogge's post about 0 correct prediction is just random. If trying multiple times, it could be 1 or 2.

However, 0/1/2 are still quite low. Are these expected?

eliahuhorwitz commented 2 years ago

@LiweiPeng @NielsRogge @addf400 Have any of you been able to figure this out? I am getting similar results, the predictions for the masked areas are wrong and as a result the reconstructed (decoded) image has wrong and random values in the masked areas.

LiweiPeng commented 2 years ago

@LiweiPeng @NielsRogge @addf400 Have any of you been able to figure this out? I am getting similar results, the predictions for the masked areas are wrong and as a result the reconstructed (decoded) image has wrong and random values in the masked areas.

Based on my tests, my 'bad' results were expected for BeiT model. I used linear probing when I did my tests. Linear probing using BeiT base model doesn't work well. The reviewed BeiT paper on https://openreview.net/pdf?id=p-BhZSz59o4 Table 9 has some detailed results.

eliahuhorwitz commented 2 years ago

@LiweiPeng @NielsRogge @addf400 Have any of you been able to figure this out? I am getting similar results, the predictions for the masked areas are wrong and as a result the reconstructed (decoded) image has wrong and random values in the masked areas.

Based on my tests, my 'bad' results were expected for BeiT model. I used linear probing when I did my tests. Linear probing using BeiT base model doesn't work well. The reviewed BeiT paper on https://openreview.net/pdf?id=p-BhZSz59o4 Table 9 has some detailed results.

I'm not sure I understand, the linear probing is related to the final class prediction. The wrong predictions I am seeing are of the tokens, I would expect the reconstruction of correct tokens to work even with the pretrained features as no class specific information is needed for this.

LiweiPeng commented 2 years ago

Sorry I am not clear in my previous post. In my case, I am using linear probing for final class prediction. And linear probing result is not good compared to some other models like DINO. For your specific problem, I am not sure.