Closed kmr2017 closed 2 years ago
Sorry for the delay, but can do let me know, from which layer did you extract the output?
Regards,
Hi @uakarsh
Thanks for your response.
I tried below code
config = { "coordinate_size": 96, "hidden_dropout_prob": 0.1, "hidden_size": 768, "image_feature_pool_shape": [7, 7, 256], "intermediate_ff_size_factor": 4, "max_2d_position_embeddings": 1000, "max_position_embeddings": 512, "max_relative_positions": 8, "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "shape_size": 96, "vocab_size": 30522, "layer_norm_eps": 1e-12, }
fp = "img.jpeg"
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") encoding = dataset.create_features(fp, tokenizer, add_batch_dim=True)
feature_extractor = modeling.ExtractFeatures(config) docformer = modeling.DocFormerEncoder(config) v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding) output = docformer(v_bar, t_bar, v_bar_s, t_bar_s) # shape (1, 512, 768)
then I visualized the output.
HI,
Actually, we know that the output is (512, 768), now, this output results from the attention of three different entities:
Now, when we perform any downstream task, we have an encoded version of these three modalities, so the diagram (which you have plotted) would be helpful for the model to know, which encoding to attend to when performing the downstream task.
The same can be seen in Pg No. 15, Figure 11. B of DocFormer Paper. Hope it helps
Thanks for your info. How can I do entity level classification like in FUNSD dataset?
@uakarsh
I have almost finished the training script for RVL-CDIP (Document Classification), and have started working on FUNSD for token classification.
You can visit my cloned repo (https://github.com/uakarsh/docformer/tree/master/examples/docformer_pl), and in the examples/docformer_pl, you can get the
Would update you soon!!
@uakarsh Hello,
Any update on NER with FUNSD using docformer?
Hi I ran the code, it is giving me final output that is too weird irrespective of changing the image. I am attaching it. Can you explain what it is?
Thanks