microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.16k stars 2.55k forks source link

[TrOCR] Image aspect ratio #867

Open riteshKumarUMass opened 2 years ago

riteshKumarUMass commented 2 years ago

Hi, I have following 3 questions and would be really grateful if anyone could provide some insights:

  1. While pertaining the model on the text lines extracted from the PDFs and synthetic data, do you maintain the aspect ratio of the image while resizing it to 384x384 size? Using the HuggingFace's TROCR preprocessor, I noticed that it does not maintain the aspect ratio and therefore, would like to understand if this would affect model's performance.
  2. Did "textline" contain multiple words in a single image or did you split the image further at word level before feeding it to the model?
  3. Did you try training the model at word level instead of line level and notice any difference?
riteshKumarUMass commented 2 years ago

Could someone respond to this?

henryle97 commented 2 years ago

Hi, I have following 3 questions and would be really grateful if anyone could provide some insights:

  1. While pertaining the model on the text lines extracted from the PDFs and synthetic data, do you maintain the aspect ratio of the image while resizing it to 384x384 size? Using the HuggingFace's TROCR preprocessor, I noticed that it does not maintain the aspect ratio and therefore, would like to understand if this would affect model's performance.
  2. Did "textline" contain multiple words in a single image or did you split the image further at word level before feeding it to the model?
  3. Did you try training the model at word level instead of line level and notice any difference?
  1. they use 384x384 setting for both printed (word-level) and handwriting (line-level). I think they use square image to fit with DeiT model.
  2. Textline contain multiple words in a single image