While tagging which OCR words overlap with what fields, the original code goes through all combinations of OCR words and ground truth fields. This PR makes it faster by only considering pairs that have overlapping y-coordinates. To make sure it does not affect anything, the candidate fields are then still processed in the original order.
For synthetic dataset, this phase sped up from 8h to 2.5h. I checked on synthetic and val datasets that the stored preprocessed dataset is completely the same (jsons have 0 diff).
This PR also turns of evaluation on the training dataset for synthetic pretraining since it was running into GPU OOM issues.
While tagging which OCR words overlap with what fields, the original code goes through all combinations of OCR words and ground truth fields. This PR makes it faster by only considering pairs that have overlapping y-coordinates. To make sure it does not affect anything, the candidate fields are then still processed in the original order.
For synthetic dataset, this phase sped up from 8h to 2.5h. I checked on synthetic and val datasets that the stored preprocessed dataset is completely the same (jsons have 0 diff).
This PR also turns of evaluation on the training dataset for synthetic pretraining since it was running into GPU OOM issues.