tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.68k stars 9.54k forks source link

Weird coordinates under chi_tra_vert.traineddata #2681

Closed MORzyuan closed 5 years ago

MORzyuan commented 5 years ago

Hi, many thanks to this fantastic work and all of you! I am here to report some wired situations about coordinates when chi_travert*.traineddata is used.

tesseract 4.1.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1 Found AVX2 Found AVX Found SSE Found libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6

ProductName: Mac OS X ProductVersion: 10.13.6 BuildVersion: 17G65

  1. tesseract with makebox set all characters' X coordinates and their width to zero tesseract [--oem 1] chi_tra_vert_test_1.jpg chi_tra_vert_1_test -l chi_tra_vert makebox chi_tra_vert_test_1_result

  2. tesseract with lstmbox failed tesseract [--oem 1] chi_tra_vert_test_2.jpg chi_tra_vert_2_test -l chi_tra_vert lstmbox chi_tra_vert_test_2_result

And here are my questions:

  1. Why I got wrong coordinates?
  2. Why the OCR characters results are right while their coordinates are wrong?
  3. Though nothing related to the wired cases. Noticed that the vertical Chinese characters are only supported by 4.x versions, and 4.x versions only have the line-level bounding-boxs as their labeled data. How can the tesseract recognize the single character in the line?
  4. Noticed that there is not GPUs training method, it's a little disturbing to train a lstm-based nerual network with CPUs, any experience(datasets amount and the time cost, etc) would really help!

Best!

amitdo commented 5 years ago

This issue should be reposted to https://github.com/tesseract-ocr/tesseract/issues.

stweil commented 5 years ago

@amitdo, reposting is not necessary. I transferred it now.

amitdo commented 5 years ago

Thanks!

amitdo commented 5 years ago

Now I see that the OP already posted this issue with different title here:

2679

stweil commented 5 years ago

So let's close the current one as a duplicate.