Performance estimates - how much time per scan?

Simon-Dirks commented 11 months ago

Hi there,

I'm running on old hardware to test and evaluate the pipeline. With my dated system (Windows 10 with WSL2, GTX970, Ryzen 5 3600, 16GB DDR4 RAM, M.2 Samsung 970 EVO Plus), I'm getting the following performance on a single run with 40 scans.

phase	time_in_ms	time_in_secs	time_in_mins
laypa_baseline_detection	25469	25,47	0,42
loghi_htr	457460	457,46	7,62
detecting_language	4007	4,01	0,07
MinionSplitPageXMLTextLineIntoWords	1998	2,00	0,03

The entire script for 40 scans takes 8.15 minutes for me (489014ms).

I was wondering what performance increase I can (more or less) expect when upgrading to modern hardware (e.g., RTX2070 or better).

If anyone in the community (or the dev) would like to share their machine's performance metrics that'd be of great help! Quick ballpark estimates would already have great value.

stefanklut commented 11 months ago

It seems to me that loghi-htr is running much slower than expected, could you verify that it is running on GPU? I believe we had issues with WSL and gpu. Maybe you could try the docker pull loghi/docker.htr-wsl:1.2 and change the docker used in the pipeline to this experimental WSL version. We are currently developing on Ubuntu, so we haven't run actively on WSL

Simon-Dirks commented 11 months ago

It seems to me that loghi-htr is running much slower than expected, could you verify that it is running on GPU? I believe we had issues with WSL and gpu. Maybe you could try the docker pull loghi/docker.htr-wsl:1.2 and change the docker used in the pipeline to this experimental WSL version. We are currently developing on Ubuntu, so we haven't run actively on WSL

I'm pretty sure it was running on the GPU! I logged GPU load which reached 99% at times, and I saw some CUDA logging as well. Couldn't it just be the dated hardware? GTX970 was released in 2014 after all..

I'll have a closer look some time next week, what ballpark range of performance should I be expecting with GPU?

rvankoert commented 11 months ago

Hi Simon,

Besides the WSL-issues some extra info:

Performance depends a bit on input. Scans with lots of textlines take more time and longer textlines are slower to process than short textlines.

There is a parameter you can add to speed up the actual HTR part: --greedy look in the na-pipeline.sh for the line that contains --beamwidth 10 and add the --greedy there so it looks like: --greedy --beamwidth 10

Alternatively you can also lower the beamwidth. Lowering the beamwidth makes things slightly less accurate, but much faster. The greedy parameter is effectively the same as --beamwidth 1, but with some extra speedups.

I can process about 50.000 scans of single page 18th century material per day on an high-end laptop (i9, 64GB + 3080TI mobile GPU)

Simon-Dirks commented 11 months ago

Hi Simon,

Besides the WSL-issues some extra info:

Performance depends a bit on input. Scans with lots of textlines take more time and longer textlines are slower to process than short textlines.

There is a parameter you can add to speed up the actual HTR part: --greedy look in the na-pipeline.sh for the line that contains --beamwidth 10 and add the --greedy there so it looks like: --greedy --beamwidth 10

Alternatively you can also lower the beamwidth. Lowering the beamwidth makes things slightly less accurate, but much faster. The greedy parameter is effectively the same as --beamwidth 1, but with some extra speedups.

I can process about 50.000 scans of single page 18th century material per day on an high-end laptop (i9, 64GB + 3080TI mobile GPU)

Thanks a lot for these additional pointers! I just did a run on Ubuntu on the exact same machine/hardware. Performance differences do not seem to be major, especially considering that I just did a single run. The slower-than-expected performance is probably simply hardware-related. Overview in table below:

phase	ubuntu_time_in_secs	windows_time_secs	ubuntu_time_in_mins	windows_time_in_mins
laypa_baseline_detection	22.32	25.47	0.37	0.42
loghi_htr	421.72	457.46	7.03	7.62
detecting_language	3.01	4.01	0.05	0.07
MinionSplitPageXMLTextLineIntoWords	1.43	2.00	0.02	0.03
Script time total	448.58	489.01	7.48	8.15

Also note that I receive the following message/warning, which might or might not be related:

Number of devices: 1
using mixed_float16
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
  NVIDIA GeForce GTX 970, compute capability 5.2
See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once

rvankoert commented 11 months ago

The mixed precision warning is definitely related. Please try to use the model that starts with float32. It should run faster on your hardware.

Simon-Dirks commented 11 months ago

I tried running with the float32 model but get the same warning unfortunately.

knaw-huc / loghi

Performance estimates - how much time per scan? #9