shabie / docformer

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
MIT License
255 stars 40 forks source link

Weird output #31

Closed kmr2017 closed 2 years ago

kmr2017 commented 2 years ago

Hi I ran the code, it is giving me final output that is too weird irrespective of changing the image. I am attaching it. Can you explain what it is?

image

Thanks

uakarsh commented 2 years ago

Sorry for the delay, but can do let me know, from which layer did you extract the output?

Regards,

kmr2017 commented 2 years ago

Hi @uakarsh

Thanks for your response.

I tried below code

config = { "coordinate_size": 96, "hidden_dropout_prob": 0.1, "hidden_size": 768, "image_feature_pool_shape": [7, 7, 256], "intermediate_ff_size_factor": 4, "max_2d_position_embeddings": 1000, "max_position_embeddings": 512, "max_relative_positions": 8, "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "shape_size": 96, "vocab_size": 30522, "layer_norm_eps": 1e-12, }

fp = "img.jpeg"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") encoding = dataset.create_features(fp, tokenizer, add_batch_dim=True)

feature_extractor = modeling.ExtractFeatures(config) docformer = modeling.DocFormerEncoder(config) v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding) output = docformer(v_bar, t_bar, v_bar_s, t_bar_s) # shape (1, 512, 768)

then I visualized the output.

uakarsh commented 2 years ago

HI,

Actually, we know that the output is (512, 768), now, this output results from the attention of three different entities:

  1. Image feature of (512, 768)
  2. Language Feature of (512, 768)
  3. Spatial Dimension of (512, 768)

Now, when we perform any downstream task, we have an encoded version of these three modalities, so the diagram (which you have plotted) would be helpful for the model to know, which encoding to attend to when performing the downstream task.

The same can be seen in Pg No. 15, Figure 11. B of DocFormer Paper. Hope it helps

kmr2017 commented 2 years ago

Thanks for your info. How can I do entity level classification like in FUNSD dataset?

kmr2017 commented 2 years ago

@uakarsh

uakarsh commented 2 years ago

I have almost finished the training script for RVL-CDIP (Document Classification), and have started working on FUNSD for token classification.

You can visit my cloned repo (https://github.com/uakarsh/docformer/tree/master/examples/docformer_pl), and in the examples/docformer_pl, you can get the

  1. Data visualizing
  2. Dataset making
  3. MLM with Pytorch Lightning
  4. Document Classification with DocFormer (would be uploaded soon) And next would be NER with FUNSD.

Would update you soon!!

BakingBrains commented 2 years ago

@uakarsh Hello,

Any update on NER with FUNSD using docformer?