rossumai / docile

DocILE: Document Information Localization and Extraction Benchmark
https://docile.rossum.ai
MIT License
116 stars 9 forks source link

Baselines, faster data preprocessing #58

Closed simsa-st closed 1 year ago

simsa-st commented 1 year ago

While tagging which OCR words overlap with what fields, the original code goes through all combinations of OCR words and ground truth fields. This PR makes it faster by only considering pairs that have overlapping y-coordinates. To make sure it does not affect anything, the candidate fields are then still processed in the original order.

For synthetic dataset, this phase sped up from 8h to 2.5h. I checked on synthetic and val datasets that the stored preprocessed dataset is completely the same (jsons have 0 diff).

This PR also turns of evaluation on the training dataset for synthetic pretraining since it was running into GPU OOM issues.