google-research-datasets / hiertext

The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
Creative Commons Attribution Share Alike 4.0 International
261 stars 23 forks source link

evaluation code is extremely slow #12

Closed HumanZhong closed 1 year ago

HumanZhong commented 1 year ago

Hi,

I'm using the provided eval.py for evaluation. However, the time cost is quite large. Running just line-level evaluation may cost 30+ minutes.

Since I'm not familiar with Apache Beam, I'm not sure this is normal or not. I've tried increasing the num_workers param to 10 but it does not help.

Can you give me some advice on how to accelerate the evaluation process?

btw, these three parts cost most of the time: 1. pipeline creating 2. 'Read' 3. 'Eval'

image

Jyouhou commented 1 year ago

It could take a long time indeed, because the dataset is very dense and lines are rendered as high-res masks.

As noted in the script, you can alternatively use the mask stride flag in the script to downsample the mask to save time. This will greatly reduce the time needed, but slightly sacrifice the accuracy of the evaluation metrics. You can try different mask strides yourselves to understand how metrics are affected by this flag, and select the best one to accelerate your development.