philschmid / document-ai-transformers

MIT License
317 stars 47 forks source link

Issue with Tokenization and Classification of Images and Tables #7

Open Harsss opened 1 year ago

Harsss commented 1 year ago

I am currently fine-tuning the LILT model on my dataset, which includes labels for various components such as headings, subheadings, text, tables, table headings, images, and captions. However, during tokenization, I encountered issues with images and tables. To resolve this, I assigned a random word for tokenization for all tables and images. However, after training the model, it does not classify any tables or images.

I am confused if I should switch to a different tokenizer from LayoutLMv3 or if there are other steps I can take to address this issue. Additionally, I am wondering to know if there are any other tokenizers that would be suitable for my dataset.