tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

Huge discrepancy in fine-tuning results on macOS and Ubuntu 18.04 when compared to Ubuntu 20.04 #188

Closed abhishekthanki closed 3 years ago

abhishekthanki commented 4 years ago

Hello,

I'm fine-tuning tesseract on a custom dataset. I have been able to do this successfully without any issues however when it came to reproduce the results on various systems, I was unable to do so. When using the same dataset and parameters, I get ~23% accuracy on Ubuntu 18.04 and macOS 10.15.6. But on Ubuntu 20.04, I get ~46% accuracy. I'm wondering why there is such a huge difference in accuracy.

The following are the tesseract version details of all three systems:

  1. macOS 10.15.6:
tesseract 4.1.1-rc2-25-g9707
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
  1. Ubuntu 18.04:
tesseract 4.1.1-rc2-25-g9707
 leptonica-1.79.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
  1. Ubuntu 20.04:
tesseract 4.1.1-rc2-25-g9707
 leptonica-1.79.0
  libgid 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : lipopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found SSE

As you can see, there is quite a bit difference in versions of libraries tesseract depends on. Could that be the reason why the results are not reproducible?

Please note that: The same instructions were followed on all systems (except minor changes made on macOS for obvious reasons).

sayakpaul commented 4 years ago

Following.

Shreeshrii commented 4 years ago

One difference I notice is that there is no FMA found on Ubuntu 20.4, so the hardware is different.

You could try with latest code from master branch on all machines to see if that makes a difference.

On Fri, Aug 21, 2020, 19:40 Abhishek Thanki notifications@github.com wrote:

Hello,

I'm fine-tuning tesseract on a custom dataset. I have been able to do this successfully without any issues however when it came to reproduce the results on various systems, I was unable to do so. When using the same dataset and parameters, I get ~23% accuracy on Ubuntu 18.04 and macOS 10.15.6. But on Ubuntu 20.04, I get ~46% accuracy. I'm wondering why there is such a huge difference in accuracy.

The following are the tesseract version details of all three systems:

  1. macOS 10.15.6:

tesseract 4.1.1-rc2-25-g9707 leptonica-1.80.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE

  1. Ubuntu 18.04:

tesseract 4.1.1-rc2-25-g9707 leptonica-1.79.0 libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found FMA Found SSE

  1. Ubuntu 20.04:

tesseract 4.1.1-rc2-25-g9707 leptonica-1.79.0 libgid 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : lipopenjp2 2.3.1 Found AVX2 Found AVX Found SSE

As you can see, there is quite a bit difference in versions of libraries tesseract depends on. Could that be the reason why the results are not reproducible?

Please note that: The same instructions were followed on all systems (except minor changes made on macOS for obvious reasons).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/188, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3GSFL7XJVHDSPYRTLSBZ55DANCNFSM4QHKNOFQ .

stweil commented 4 years ago

I also suggest to repeat the test with latest Tesseract. Reproducible training results are very important. Are the results the same when you repeat the training on the same machine?

abhishekthanki commented 4 years ago

@Shreeshrii That's the case because Ubuntu 20.04 is being run on a VM using VirtualBox (which does not support FMA). Do you think that could be the root cause of this issue?

@Shreeshrii @stweil I tried with the latest version of Tesseract and I'm getting the same results as before.

wrznr commented 4 years ago

@abhishekthanki I fear that if you want a more concrete answer you will have to provide some sample output and the command you used for training at least. It would be perfect if you could provide a (minimal) data set which leads to the odd behavior.

wrznr commented 3 years ago

No further progress. Most likely a tesseract problem. Closing.