microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.2k stars 2.55k forks source link

Provide a script to get OCR result for RVL-CDIP in layoutLM #216

Open yaoliUoA opened 4 years ago

yaoliUoA commented 4 years ago

Describe I am using LayoutLM, would you please provide the script to prepare the OCR output (html format) for the RVL-CDIP? The readme mentions about Tesseract, however it will be much conventient if the script can be provided.

Lambert-Shirzad commented 4 years ago

Yes, the trained model is not that useful and experiments reported in the paper are not that repeatable if the pre-processing step for an input image needs to be reverse engineered.

hazoth commented 3 years ago

need ocr script too. especially for the parameters to run tesseract. if possible, the processed hocr files would be better please. i can not repeat the same result in the paper now.

siatwangmin commented 3 years ago

Yes, it is hard to repeat the result if we can't get the processed data, can you release the data?

greeneggsandyaml commented 3 years ago

Yes, this would be very helpful for our research!

Yazooliu commented 3 years ago

Yes,also need the OCR output (html format) or scipit to understand the inputs(such as layout info, bbox position info, etc) thanks