piegu / language-models

pre-trained Language Models
280 stars 91 forks source link

How did you created DocLayNet-small #10

Open mit1280 opened 1 year ago

mit1280 commented 1 year ago

Hi @piegu,

Thank you for creating DocLayNet datasets (small, base and large). It's very time saving in finetune model for downstream task.

I have question on bounding boxes. I checked your notebooks and found that in the dataset there are two kinda bounding boxes e.g. line level and block level (paragraph). I created model using "bboxes_block". It's performing good. But my input data has only line level bounding box so wondering how had you created DocLayNet dataset (which is on huggingface). My hunch is OCR engine (pytesseract) but still want to hear it from you.

Thanks in advance!