Open Saeed11b95 opened 1 month ago
Hey 🤗 thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead? I'm sure the community will be of help!
I think you can ping @NielsRogge as he contributed this model! Thanks!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
https://github.com/huggingface/transformers/blob/0a7af19f4dc868bafc82f35eb7e8d13bac87a594/src/transformers/models/layoutlmv3/modeling_layoutlmv3.py#L1349
In the above line the first token from the output of transformer is used to do classification logit computation. I am confused because in the source code the learnable classification token is not at the zeroth index. The cls_token is concatenated at the start of image patch tokens visual_tokens = torch.cat([cls_token, visual_path_embeddings]) and to create transformer input the text+bbox inuputs and image tokens are concatenated as follows, transformer_inp = torch.cat([text_embeddings, visual_embedding]) this means that my classification is at index 512 considering that we put limit of 512 tokens on text inputs. This is just for clarification, using first token for classification also does a fine job. Thanks