tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.01k stars 9.38k forks source link

Of two inverted top right texts one gets scanned double, the upper one disappears #3871

Open rmast opened 2 years ago

rmast commented 2 years ago

I'm investigating my issue earlier spotted in https://github.com/tesseract-ocr/tesseract/pull/3141 further.

In this picture above the text 'wis-clear' on the right, there is a text 'print'. This text print disappears completely and the text wis-clear has been read in twice.


Environment

Current Behavior:

Some inverted text on the top right disappears, other text gets scanned in twice.

There are two similar bounding boxes involved:

  1. Processing word with lang Latin at:Bounding box=(2149,3103)->(2396,3144)
  2. Processing word with lang Latin at:Bounding box=(2194,3114)->(2336,3137)
tesseract -c invert_threshold=0.9 --dpi 300 -l Latin -c textord_debug_tabfind=1 -c textord_debug_bugs=1 -c tessedit_reject_block_percent=1 -c tessedit_reject_row_percent=1 -c debug_noise_removal=1 -c textord_debug_block=28 -c textord_tabfind_show_partitions=1 -c textord_tabfind_show_initial_partitions=1 -c textord_tabfind_show_reject_blobs=1 -c textord_tabfind_show_blocks=1 -c textord_show_final_rows=1 -c textord_show_final_blobs=1 -c textord_show_initial_rows=1 -c textord_debug_blob=1 -c textord_oldbl_debug=1 -c textord_debug_baselines=1 -c textord_show_tables=1 -c textord_test_mode=1 -c classify_debug_level=1 -c dawg_debug_level=1 -c wordrec_debug_level=1 -c segsearch_debug_level=1 -c wordrec_display_segmentations=1 -c bidi_debug=1 -c debug_noise_removal=1 -c paragraph_debug_level=1 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg output89

Processing word with lang Latin at:Bounding box=(2149,3103)->(2396,3144)
Trying word using lang Latin, oem 1
Inverting image: old min=0.19685, mean=0.19685, sd=0, inv 0.228346,0.826547,0.197407
<null>=301 On [0, 2), scores= 98.8(I=6=0.102) 99.2(|=61=0.26), Mean=98.9874, max=99.209
|=61 On [2, 7), scores= 19.8(<null>=301=50.9) 19.8([=55=22.9) 0.0475(<null>=301=79.1) 5.46e-06(<null>=301=95.3) 1.85e-07(<null>=301=98.3), Mean=7.9283, max=19.8422
>=132 On [7, 12), scores= 88.4(<null>=301=7.39) 98.7()=8=0.725) 39.6(<null>=301=60.4) 0.00681(<null>=301=100) 2.44e-05(<null>=301=99.8), Mean=45.3386, max=98.7152
 =0 On [12, 17), scores= 71.6(<null>=301=28.3) 99.6(<null>=301=0.327) 85(<null>=301=14.9) 0.619(<null>=301=99.4) 0.000118(<null>=301=99.8), Mean=51.3589, max=99.5623
w=89 On [17, 22), scores= 80.5(<null>=301=19.5) 100(<null>=301=0.0252) 87.4(<null>=301=12.6) 0.0225(<null>=301=100) 1.3e-06(<null>=301=99.9), Mean=53.5728, max=99.9746
i=78 On [22, 26), scores= 75.7(<null>=301=24.3) 100(í=153=0.0154) 51.1(<null>=301=48.9) 0.000347(<null>=301=100), Mean=56.6989, max=99.9755
s=74 On [26, 33), scores= 6.01(<null>=301=94) 98.4(<null>=301=1.64) 99.9(<null>=301=0.0488) 52.9(<null>=301=47.1) 0.000503(<null>=301=100) 0.000516(<null>=301=98.2) 0.000678(<null>=301=83.4), Mean=36.7451, max=99.9412
-=14 On [33, 40), scores= 22.1(<null>=301=70.7) 94.5(<null>=301=5.25) 99.5(<null>=301=0.327) 27.5(<null>=301=72.5) 0.0107(<null>=301=100) 0.00745(<null>=301=93.4) 0.0476(<null>=301=92.3), Mean=34.808, max=99.4556
c=86 On [40, 46), scores= 66.9(<null>=301=30.3) 96.5(<null>=301=2.68) 97.2(<null>=301=1.46) 10.9(<null>=301=89) 0.000122(<null>=301=100) 1.28e-05(<null>=301=91.2), Mean=45.2322, max=97.1531
l=84 On [46, 50), scores= 90.1(<null>=301=8.78) 98.7(I=6=0.467) 47.2(<null>=301=52.8) 0.0014(<null>=301=100), Mean=59.004, max=98.7325
e=73 On [50, 56), scores= 12.1(<null>=301=87.9) 96.4(<null>=301=3.57) 99.6(<null>=301=0.156) 45.2(<null>=301=54.8) 1.23e-06(<null>=301=100) 1.46e-06(<null>=301=99), Mean=42.2193, max=99.6454
a=82 On [56, 61), scores= 91.5(<null>=301=8.5) 99.9(<null>=301=0.0377) 61.3(<null>=301=38.7) 3.06e-06(<null>=301=100) 5.04e-08(<null>=301=95.5), Mean=50.5396, max=99.9351
r=68 On [61, 67), scores= 90.1(<null>=301=9.85) 99.9(<null>=301=0.0692) 89.4(<null>=301=10.6) 0.123(<null>=301=99.9) 0.00747(<null>=301=99.7) 0.0617(<null>=301=86.8), Mean=46.5988, max=99.8875
 =0 On [67, 75), scores= 52.2(<null>=301=40.9) 61.3(<null>=301=30.5) 37.8(<null>=301=44.4) 17.2(<null>=301=56.1) 4.55(<null>=301=60.1) 4.55(<null>=301=72.9) 3.99(<null>=301=86.8) 7.51(<null>=301=86.3), Mean=23.6451, max=61.3004
 =0 On [75, 79), scores= 28.3(<null>=301=65.7) 47(<null>=301=45.8) 20.1(<null>=301=77.5) 2.62(<null>=301=94.2), Mean=24.5262, max=47.0312
|=61 On [79, 83), scores= 65.5(<null>=301=33.8) 99.8(]=50=0.0972) 5.8(<null>=301=94.1) 1.23e-06(<null>=301=100), Mean=42.7894, max=99.8109
0 null_char score=-0.0974199, c=-0.0974199, perm=0, hash=0
1 null_char score=-0.190361, c=-0.0929412, perm=0, hash=0 prev:null_char score=-0.0974199, c=-0.0974199, perm=0, hash=0
2 label=61, uid=63=| [7c ] score=-0.620865, c=-0.430504, Start End perm=8, hash=3d prev:null_char score=-0.190361, c=-0.0929412, perm=0, hash=0
3 label=61, uid=63=| [7c ] score=-2.32779, c=-1.70693, perm=8, hash=3d prev:label=61, uid=63=| [7c ] score=-0.620865, c=-0.430504, Start End perm=8, hash=3d
4 null_char score=-2.64755, c=-0.31976, perm=0, hash=3d prev:label=61, uid=63=| [7c ] score=-2.32779, c=-1.70693, perm=8, hash=3d
5 null_char score=-2.78023, c=-0.132677, perm=0, hash=3d prev:null_char score=-2.64755, c=-0.31976, perm=0, hash=3d
6 null_char score=-2.88235, c=-0.102121, perm=0, hash=3d prev:null_char score=-2.78023, c=-0.132677, perm=0, hash=3d
7 label=132, uid=134=> [3e ] score=-3.0103, c=-0.12795, End perm=8, hash=487a prev:null_char score=-2.88235, c=-0.102121, perm=0, hash=3d
8 label=132, uid=134=> [3e ] score=-3.10823, c=-0.0979313, perm=8, hash=487a prev:label=132, uid=134=> [3e ] score=-3.0103, c=-0.12795, End perm=8, hash=487a
9 label=132, uid=134=> [3e ] score=-3.19386, c=-0.0856231, perm=8, hash=487a prev:label=132, uid=134=> [3e ] score=-3.10823, c=-0.0979313, perm=8, hash=487a
10 null_char score=-3.27893, c=-0.085074, perm=0, hash=487a prev:label=132, uid=134=> [3e ] score=-3.19386, c=-0.0856231, perm=8, hash=487a
11 null_char score=-3.36568, c=-0.0867499, perm=0, hash=487a prev:null_char score=-3.27893, c=-0.085074, perm=0, hash=487a
12 label=0, uid=0=  [20 ] score=-3.45184, c=-0.0861606, DawgStart perm=8, hash=557fec prev:null_char score=-3.36568, c=-0.0867499, perm=0, hash=487a
13 label=0, uid=0=  [20 ] score=-3.54123, c=-0.0893863, perm=8, hash=557fec prev:label=0, uid=0=  [20 ] score=-3.45184, c=-0.0861606, DawgStart perm=8, hash=557fec
14 label=0, uid=0=  [20 ] score=-3.62727, c=-0.0860443, perm=8, hash=557fec prev:label=0, uid=0=  [20 ] score=-3.54123, c=-0.0893863, perm=8, hash=557fec
15 null_char score=-3.71857, c=-0.0912952, perm=0, hash=557fec prev:label=0, uid=0=  [20 ] score=-3.62727, c=-0.0860443, perm=8, hash=557fec
16 null_char score=-3.80511, c=-0.0865476, perm=0, hash=557fec prev:null_char score=-3.71857, c=-0.0912952, perm=0, hash=557fec
17 label=89, uid=91=w [77 ]a score=-3.89015, c=-0.0850372, Start End perm=8, hash=64dce8c1 prev:null_char score=-3.80511, c=-0.0865476, perm=0, hash=557fec
18 label=89, uid=91=w [77 ]a score=-3.9754, c=-0.0852538, perm=8, hash=64dce8c1 prev:label=89, uid=91=w [77 ]a score=-3.89015, c=-0.0850372, Start End perm=8, hash=64dce8c1
19 label=89, uid=91=w [77 ]a score=-4.06044, c=-0.0850329, perm=8, hash=64dce8c1 prev:label=89, uid=91=w [77 ]a score=-3.9754, c=-0.0852538, perm=8, hash=64dce8c1
20 null_char score=-4.14566, c=-0.0852257, perm=0, hash=64dce8c1 prev:label=89, uid=91=w [77 ]a score=-4.06044, c=-0.0850329, perm=8, hash=64dce8c1
21 null_char score=-4.23171, c=-0.0860455, perm=0, hash=64dce8c1 prev:null_char score=-4.14566, c=-0.0852257, perm=0, hash=64dce8c1
22 label=78, uid=80=i [69 ]a score=-4.31672, c=-0.0850103, End perm=8, hash=76fc9a93fc prev:null_char score=-4.23171, c=-0.0860455, perm=0, hash=64dce8c1
23 label=78, uid=80=i [69 ]a score=-4.40196, c=-0.0852454, perm=8, hash=76fc9a93fc prev:label=78, uid=80=i [69 ]a score=-4.31672, c=-0.0850103, End perm=8, hash=76fc9a93fc
24 label=78, uid=80=i [69 ]a score=-4.48699, c=-0.0850235, perm=8, hash=76fc9a93fc prev:label=78, uid=80=i [69 ]a score=-4.40196, c=-0.0852454, perm=8, hash=76fc9a93fc
25 null_char score=-4.57199, c=-0.0850036, perm=0, hash=76fc9a93fc prev:label=78, uid=80=i [69 ]a score=-4.48699, c=-0.0850235, perm=8, hash=76fc9a93fc
26 label=74, uid=76=s [73 ]a score=-4.65701, c=-0.0850182, End perm=8, hash=8c5dfe5a9392 prev:null_char score=-4.57199, c=-0.0850036, perm=0, hash=76fc9a93fc
27 label=74, uid=76=s [73 ]a score=-4.75855, c=-0.101541, perm=8, hash=8c5dfe5a9392 prev:label=74, uid=76=s [73 ]a score=-4.65701, c=-0.0850182, End perm=8, hash=8c5dfe5a9392
28 label=74, uid=76=s [73 ]a score=-4.84414, c=-0.0855878, perm=8, hash=8c5dfe5a9392 prev:label=74, uid=76=s [73 ]a score=-4.75855, c=-0.101541, perm=8, hash=8c5dfe5a9392
29 label=74, uid=76=s [73 ]a score=-4.92929, c=-0.0851501, perm=8, hash=8c5dfe5a9392 prev:label=74, uid=76=s [73 ]a score=-4.84414, c=-0.0855878, perm=8, hash=8c5dfe5a9392
30 null_char score=-5.0143, c=-0.0850075, perm=0, hash=8c5dfe5a9392 prev:label=74, uid=76=s [73 ]a score=-4.92929, c=-0.0851501, perm=8, hash=8c5dfe5a9392
31 label=0, uid=0=  [20 ] score=-5.09956, c=-0.0852623, DawgStart perm=8, hash=a596e20eda163c prev:null_char score=-5.0143, c=-0.0850075, perm=0, hash=8c5dfe5a9392
32 label=0, uid=0=  [20 ] score=-7.03714, c=-1.93758, perm=8, hash=a596e20eda163c prev:label=0, uid=0=  [20 ] score=-5.09956, c=-0.0852623, DawgStart perm=8, hash=a596e20eda163c
33 label=14, uid=16=- [2d ]p score=-7.19661, c=-0.159473, Start perm=1, hash=c357fead85463ad6 prev:label=0, uid=0=  [20 ] score=-7.03714, c=-1.93758, perm=8, hash=a596e20eda163c
34 label=14, uid=16=- [2d ]p score=-7.33845, c=-0.141836, perm=1, hash=c357fead85463ad6 prev:label=14, uid=16=- [2d ]p score=-7.19661, c=-0.159473, Start perm=1, hash=c357fead85463ad6
35 label=14, uid=16=- [2d ]p score=-7.42891, c=-0.0904584, perm=1, hash=c357fead85463ad6 prev:label=14, uid=16=- [2d ]p score=-7.33845, c=-0.141836, perm=1, hash=c357fead85463ad6
36 label=14, uid=16=- [2d ]p score=-7.51424, c=-0.0853334, perm=1, hash=c357fead85463ad6 prev:label=14, uid=16=- [2d ]p score=-7.42891, c=-0.0904584, perm=1, hash=c357fead85463ad6
37 null_char score=-7.59962, c=-0.0853829, perm=0, hash=c357fead85463ad6 prev:label=14, uid=16=- [2d ]p score=-7.51424, c=-0.0853334, perm=1, hash=c357fead85463ad6
38 null_char score=-7.75306, c=-0.153435, perm=0, hash=c357fead85463ad6 prev:null_char score=-7.59962, c=-0.0853829, perm=0, hash=c357fead85463ad6
39 null_char score=-7.91821, c=-0.165152, perm=0, hash=c357fead85463ad6 prev:null_char score=-7.75306, c=-0.153435, perm=0, hash=c357fead85463ad6
40 label=86, uid=88=c [63 ]a score=-8.03211, c=-0.113906, End perm=8, hash=71ce70b338d969b0 prev:null_char score=-7.91821, c=-0.165152, perm=0, hash=c357fead85463ad6
41 label=86, uid=88=c [63 ]a score=-8.15245, c=-0.120332, perm=8, hash=71ce70b338d969b0 prev:label=86, uid=88=c [63 ]a score=-8.03211, c=-0.113906, End perm=8, hash=71ce70b338d969b0
42 label=86, uid=88=c [63 ]a score=-8.26633, c=-0.113882, perm=8, hash=71ce70b338d969b0 prev:label=86, uid=88=c [63 ]a score=-8.15245, c=-0.120332, perm=8, hash=71ce70b338d969b0
43 label=86, uid=88=c [63 ]a score=-8.35292, c=-0.0865956, perm=8, hash=71ce70b338d969b0 prev:label=86, uid=88=c [63 ]a score=-8.26633, c=-0.113882, perm=8, hash=71ce70b338d969b0
44 null_char score=-8.43794, c=-0.0850122, perm=0, hash=71ce70b338d969b0 prev:label=86, uid=88=c [63 ]a score=-8.35292, c=-0.0865956, perm=8, hash=71ce70b338d969b0
45 null_char score=-8.61489, c=-0.176952, perm=0, hash=71ce70b338d969b0 prev:null_char score=-8.43794, c=-0.0850122, perm=0, hash=71ce70b338d969b0
46 label=84, uid=86=l [6c ]a score=-8.71158, c=-0.0966929, End perm=8, hash=4188f36d107aae7a prev:null_char score=-8.61489, c=-0.176952, perm=0, hash=71ce70b338d969b0
47 label=84, uid=86=l [6c ]a score=-8.80934, c=-0.0977558, perm=8, hash=4188f36d107aae7a prev:label=84, uid=86=l [6c ]a score=-8.71158, c=-0.0966929, End perm=8, hash=4188f36d107aae7a
48 label=84, uid=86=l [6c ]a score=-8.89453, c=-0.0851927, perm=8, hash=4188f36d107aae7a prev:label=84, uid=86=l [6c ]a score=-8.80934, c=-0.0977558, perm=8, hash=4188f36d107aae7a
49 null_char score=-8.97964, c=-0.0851056, perm=0, hash=4188f36d107aae7a prev:label=84, uid=86=l [6c ]a score=-8.89453, c=-0.0851927, perm=8, hash=4188f36d107aae7a
50 label=73, uid=75=e [65 ]a score=-9.06464, c=-0.0850046, End perm=8, hash=4f8f2aa970b9d482 prev:null_char score=-8.97964, c=-0.0851056, perm=0, hash=4188f36d107aae7a
51 label=73, uid=75=e [65 ]a score=-9.18636, c=-0.121723, perm=8, hash=4f8f2aa970b9d482 prev:label=73, uid=75=e [65 ]a score=-9.06464, c=-0.0850046, End perm=8, hash=4f8f2aa970b9d482
52 label=73, uid=75=e [65 ]a score=-9.27491, c=-0.0885518, perm=8, hash=4f8f2aa970b9d482 prev:label=73, uid=75=e [65 ]a score=-9.18636, c=-0.121723, perm=8, hash=4f8f2aa970b9d482
53 label=73, uid=75=e [65 ]a score=-9.36019, c=-0.0852759, perm=8, hash=4f8f2aa970b9d482 prev:label=73, uid=75=e [65 ]a score=-9.27491, c=-0.0885518, perm=8, hash=4f8f2aa970b9d482
54 null_char score=-9.4452, c=-0.0850073, perm=0, hash=4f8f2aa970b9d482 prev:label=73, uid=75=e [65 ]a score=-9.36019, c=-0.0852759, perm=8, hash=4f8f2aa970b9d482
55 null_char score=-9.54014, c=-0.0949398, perm=0, hash=4f8f2aa970b9d482 prev:null_char score=-9.4452, c=-0.0850073, perm=0, hash=4f8f2aa970b9d482
56 label=82, uid=84=a [61 ]a score=-9.62516, c=-0.0850175, End perm=8, hash=dae453e2fb38b20b prev:null_char score=-9.54014, c=-0.0949398, perm=0, hash=4f8f2aa970b9d482
57 label=82, uid=84=a [61 ]a score=-9.7108, c=-0.0856489, perm=8, hash=dae453e2fb38b20b prev:label=82, uid=84=a [61 ]a score=-9.62516, c=-0.0850175, End perm=8, hash=dae453e2fb38b20b
58 label=82, uid=84=a [61 ]a score=-9.79602, c=-0.0852179, perm=8, hash=dae453e2fb38b20b prev:label=82, uid=84=a [61 ]a score=-9.7108, c=-0.0856489, perm=8, hash=dae453e2fb38b20b
59 null_char score=-9.88112, c=-0.085103, perm=0, hash=dae453e2fb38b20b prev:label=82, uid=84=a [61 ]a score=-9.79602, c=-0.0852179, perm=8, hash=dae453e2fb38b20b
60 null_char score=-10.0118, c=-0.130652, perm=0, hash=dae453e2fb38b20b prev:null_char score=-9.88112, c=-0.085103, perm=0, hash=dae453e2fb38b20b
61 label=68, uid=70=r [72 ]a score=-10.0971, c=-0.0853579, End perm=8, hash=395af5c45ce20a40 prev:null_char score=-10.0118, c=-0.130652, perm=0, hash=dae453e2fb38b20b
62 label=68, uid=70=r [72 ]a score=-10.1833, c=-0.0861257, perm=8, hash=395af5c45ce20a40 prev:label=68, uid=70=r [72 ]a score=-10.0971, c=-0.0853579, End perm=8, hash=395af5c45ce20a40
63 label=68, uid=70=r [72 ]a score=-10.2687, c=-0.085422, perm=8, hash=395af5c45ce20a40 prev:label=68, uid=70=r [72 ]a score=-10.1833, c=-0.0861257, perm=8, hash=395af5c45ce20a40
64 null_char score=-10.355, c=-0.0863678, perm=0, hash=395af5c45ce20a40 prev:label=68, uid=70=r [72 ]a score=-10.2687, c=-0.085422, perm=8, hash=395af5c45ce20a40
65 null_char score=-10.4426, c=-0.0875374, perm=0, hash=395af5c45ce20a40 prev:null_char score=-10.355, c=-0.0863678, perm=0, hash=395af5c45ce20a40
66 null_char score=-10.6689, c=-0.226327, perm=0, hash=395af5c45ce20a40 prev:null_char score=-10.4426, c=-0.0875374, perm=0, hash=395af5c45ce20a40
67 label=0, uid=0=  [20 ] score=-10.826, c=-0.157074, DawgStart perm=8, hash=a94deda592a817c3 prev:null_char score=-10.6689, c=-0.226327, perm=0, hash=395af5c45ce20a40
68 label=0, uid=0=  [20 ] score=-11.4004, c=-0.574383, perm=8, hash=a94deda592a817c3 prev:label=0, uid=0=  [20 ] score=-10.826, c=-0.157074, DawgStart perm=8, hash=a94deda592a817c3
69 label=0, uid=0=  [20 ] score=-11.6816, c=-0.281208, perm=8, hash=a94deda592a817c3 prev:label=0, uid=0=  [20 ] score=-11.4004, c=-0.574383, perm=8, hash=a94deda592a817c3
70 null_char score=-12.3443, c=-0.662683, perm=0, hash=a94deda592a817c3 prev:label=0, uid=0=  [20 ] score=-11.6816, c=-0.281208, perm=8, hash=a94deda592a817c3
71 null_char score=-12.9383, c=-0.594, perm=0, hash=a94deda592a817c3 prev:null_char score=-12.3443, c=-0.662683, perm=0, hash=a94deda592a817c3
72 null_char score=-13.3396, c=-0.401361, perm=0, hash=a94deda592a817c3 prev:null_char score=-12.9383, c=-0.594, perm=0, hash=a94deda592a817c3
73 null_char score=-13.5661, c=-0.226458, perm=0, hash=a94deda592a817c3 prev:null_char score=-13.3396, c=-0.401361, perm=0, hash=a94deda592a817c3
74 null_char score=-13.7988, c=-0.232681, perm=0, hash=a94deda592a817c3 prev:null_char score=-13.5661, c=-0.226458, perm=0, hash=a94deda592a817c3
75 null_char score=-14.3032, c=-0.504484, perm=0, hash=a94deda592a817c3 prev:null_char score=-13.7988, c=-0.232681, perm=0, hash=a94deda592a817c3
76 null_char score=-15.169, c=-0.865767, perm=0, hash=a94deda592a817c3 prev:null_char score=-14.3032, c=-0.504484, perm=0, hash=a94deda592a817c3
77 null_char score=-15.5085, c=-0.339494, perm=0, hash=a94deda592a817c3 prev:null_char score=-15.169, c=-0.865767, perm=0, hash=a94deda592a817c3
78 null_char score=-15.6532, c=-0.144672, perm=0, hash=a94deda592a817c3 prev:null_char score=-15.5085, c=-0.339494, perm=0, hash=a94deda592a817c3
79 label=61, uid=63=| [7c ] score=-15.7448, c=-0.0916148, Start End perm=8, hash=b9ee5953024c090e prev:null_char score=-15.6532, c=-0.144672, perm=0, hash=a94deda592a817c3
80 label=61, uid=63=| [7c ] score=-15.8317, c=-0.0868923, perm=8, hash=b9ee5953024c090e prev:label=61, uid=63=| [7c ] score=-15.7448, c=-0.0916148, Start End perm=8, hash=b9ee5953024c090e
81 label=61, uid=63=| [7c ] score=-15.9172, c=-0.085523, perm=8, hash=b9ee5953024c090e prev:label=61, uid=63=| [7c ] score=-15.8317, c=-0.0868923, perm=8, hash=b9ee5953024c090e
82 null_char score=-16.0022, c=-0.0850035, perm=0, hash=b9ee5953024c090e prev:label=61, uid=63=| [7c ] score=-15.9172, c=-0.085523, perm=8, hash=b9ee5953024c090e

Second choice path:
2 63=| [7c ] r=2.32779, c=-1.70693, s=1, e=1, perm=8
7 134=> [3e ] r=1.03789, c=-0.31976, s=0, e=1, perm=8
12 0=  [20 ] r=0.261591, c=-0.0893863, s=0, e=0, perm=8
17 91=w [77 ]a r=0.433167, c=-0.0912952, s=1, e=1, perm=8
22 80=i [69 ]a r=0.426551, c=-0.0860455, s=0, e=1, perm=8
26 76=s [73 ]a r=0.527308, c=-0.101541, s=0, e=1, perm=8
31 0=  [20 ] r=2.02284, c=-1.93758, s=0, e=0, perm=8
33 16=- [2d ]p r=0.477101, c=-0.159473, s=1, e=0, perm=1
40 88=c [63 ]a r=0.838686, c=-0.165152, s=0, e=1, perm=8
46 86=l [6c ]a r=0.541606, c=-0.176952, s=0, e=1, perm=8
50 75=e [65 ]a r=0.465661, c=-0.121723, s=0, e=1, perm=8
56 84=a [61 ]a r=0.435831, c=-0.0949398, s=0, e=1, perm=8
61 70=r [72 ]a r=6.64548, c=-1.27292, s=0, e=1, perm=8
76 0=  [20 ] r=1.08423, c=-0.83936, s=0, e=0, perm=8
79 63=| [7c ] r=1.11084, c=-0.325512, s=0, e=0, perm=2
Path total rating = 18.6366
2 63=| [7c ] r=2.32779, c=-1.70693, s=1, e=1, perm=8
7 134=> [3e ] r=1.03789, c=-0.31976, s=0, e=1, perm=8
12 0=  [20 ] r=0.261591, c=-0.0893863, s=0, e=0, perm=8
17 91=w [77 ]a r=0.433167, c=-0.0912952, s=1, e=1, perm=8
22 80=i [69 ]a r=0.426551, c=-0.0860455, s=0, e=1, perm=8
26 76=s [73 ]a r=0.527308, c=-0.101541, s=0, e=1, perm=8
31 0=  [20 ] r=2.02284, c=-1.93758, s=0, e=0, perm=8
33 16=- [2d ]p r=0.477101, c=-0.159473, s=1, e=0, perm=1
40 88=c [63 ]a r=0.838686, c=-0.165152, s=0, e=1, perm=8
46 86=l [6c ]a r=0.541606, c=-0.176952, s=0, e=1, perm=8
50 75=e [65 ]a r=0.465661, c=-0.121723, s=0, e=1, perm=8
56 84=a [61 ]a r=0.435831, c=-0.0949398, s=0, e=1, perm=8
61 70=r [72 ]a r=0.872893, c=-0.226327, s=0, e=1, perm=8
67 0=  [20 ] r=1.01267, c=-0.574383, s=0, e=0, perm=8
79 63=| [7c ] r=4.32063, c=-0.865767, s=1, e=1, perm=8
Path total rating = 16.0022
Best choice: accepted=0, adaptable=0, done=1 : Lang result : |> : R=3.36568, C=-11.9485, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM
str     |       >
state:  1       1
C       -1.707  -0.320
Best choice: accepted=0, adaptable=0, done=1 : Lang result : wis : R=1.38703, C=-13.5631, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM
str     w       i       s
state:  1       1       1
C       -0.091  -0.086  -0.102
Best choice: accepted=0, adaptable=0, done=1 : Lang result : -clear : R=3.63178, C=-13.5631, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM    NORM    NORM    NORM
str     -       c       l       e       a       r
state:  1       1       1       1       1       1
C       -0.159  -0.165  -0.177  -0.122  -0.095  -0.226
Best choice: accepted=0, adaptable=0, done=1 : Lang result : | : R=4.32063, C=-6.06037, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM
str     |
state:  1
C       -0.866
1 new words better than 0 old words: r: 3.36568 v 0 c: -11.9485 v 0 valid dict: 1 v 0
1 new words better than 0 old words: r: 1.38703 v 0 c: -13.5631 v 0 valid dict: 1 v 0
1 new words better than 0 old words: r: 3.63178 v 0 c: -13.5631 v 0 valid dict: 1 v 0
1 new words better than 0 old words: r: 4.32063 v 0 c: -6.06037 v 0 valid dict: 1 v 0
Pass1: | [| [7c ] ]
Processing word with lang Latin at:Bounding box=(2194,3114)->(2336,3137)
Trying word using lang Latin, oem 1
Inverting image: old min=0.133858, mean=0.137795, sd=0.00393701, inv 0.598425,0.91916,0.115633
<null>=301 On [0, 2), scores= 100(a=82=0.00099) 100(w=89=4.14e-05), Mean=99.9961, max=99.9999
w=89 On [2, 8), scores= 4.35(<null>=301=95.6) 99.2(<null>=301=0.827) 100(<null>=301=0.00486) 99.5(<null>=301=0.479) 2.57(<null>=301=97.4) 1.74e-09(<null>=301=100), Mean=50.9318, max=99.9945
i=78 On [8, 14), scores= 15.8(<null>=301=84.2) 99.8(<null>=301=0.18) 100(I=6=0.00295) 59.9(<null>=301=40.1) 0.00117(<null>=301=100) 4.78e-05(<null>=301=100), Mean=45.9054, max=99.9933
s=74 On [14, 19), scores= 65.3(<null>=301=34.6) 100(S=3=0.00204) 99.8(<null>=301=0.158) 14(<null>=301=85.9) 4.58e-07(<null>=301=100), Mean=55.8452, max=99.9955
 =0 On [19, 25), scores= 6.64(<null>=301=93.4) 99.1(<null>=301=0.856) 100(<null>=301=0.0121) 75.5(<null>=301=24.5) 0.0372(<null>=301=100) 3.23e-05(<null>=301=100), Mean=46.8835, max=99.9856
-=14 On [25, 30), scores= 48(<null>=301=51.9) 97.7(-=291=0.532) 84.5(<null>=301=15.4) 0.274(<null>=301=99.7) 1.61e-05(<null>=301=100), Mean=46.0863, max=97.6613
 =0 On [30, 36), scores= 42.1(<null>=301=57.9) 99.7(<null>=301=0.325) 100(<null>=301=0.0303) 39.6(<null>=301=60.4) 0.00343(<null>=301=100) 0.00507(<null>=301=99.3), Mean=46.899, max=99.969
c=86 On [36, 41), scores= 85.2(<null>=301=14.8) 99.9(e=73=0.0795) 84.5(<null>=301=15.5) 0.163(<null>=301=99.8) 1.79e-08(<null>=301=100), Mean=53.9375, max=99.8645
l=84 On [41, 47), scores= 17.3(<null>=301=82.7) 97.6(<null>=301=2.25) 99.3(<null>=301=0.665) 47.2(<null>=301=52.8) 0.000567(<null>=301=100) 7.29e-08(<null>=301=100), Mean=43.5683, max=99.2975
e=73 On [47, 53), scores= 75.5(<null>=301=24.5) 100(<null>=301=0.0134) 99.9(<null>=301=0.0646) 27.5(<null>=301=72.5) 1.67e-06(<null>=301=100) 1.85e-07(<null>=301=99.9), Mean=50.4882, max=99.9856
a=82 On [53, 59), scores= 73.5(<null>=301=26.5) 100(<null>=301=0.00326) 99.9(<null>=301=0.123) 8.94(<null>=301=91.1) 3.76e-08(<null>=301=100) 9.61e-09(<null>=301=98.7), Mean=47.0509, max=99.9966
r=68 On [59, 65), scores= 97.5(<null>=301=2.53) 100(<null>=301=0.00341) 81.3(<null>=301=18.7) 0.117(<null>=301=99.9) 5.71e-05(<null>=301=100) 1.01e-07(<null>=301=100), Mean=46.4862, max=99.9957
0 null_char score=-0.0850772, c=-0.0850772, perm=0, hash=0
1 null_char score=-0.170079, c=-0.0850014, perm=0, hash=0 prev:null_char score=-0.0850772, c=-0.0850772, perm=0, hash=0
2 label=89, uid=91=w [77 ]a score=-0.255138, c=-0.0850592, Start End perm=8, hash=59 prev:null_char score=-0.170079, c=-0.0850014, perm=0, hash=0
3 label=89, uid=91=w [77 ]a score=-0.348624, c=-0.0934861, perm=8, hash=59 prev:label=89, uid=91=w [77 ]a score=-0.255138, c=-0.0850592, Start End perm=8, hash=59
4 label=89, uid=91=w [77 ]a score=-0.433679, c=-0.0850552, perm=8, hash=59 prev:label=89, uid=91=w [77 ]a score=-0.348624, c=-0.0934861, perm=8, hash=59
5 label=89, uid=91=w [77 ]a score=-0.523489, c=-0.0898102, perm=8, hash=59 prev:label=89, uid=91=w [77 ]a score=-0.433679, c=-0.0850552, perm=8, hash=59
6 label=89, uid=91=w [77 ]a score=-0.608491, c=-0.0850013, perm=8, hash=59 prev:label=89, uid=91=w [77 ]a score=-0.523489, c=-0.0898102, perm=8, hash=59
7 null_char score=-0.693491, c=-0.085, perm=0, hash=59 prev:label=89, uid=91=w [77 ]a score=-0.608491, c=-0.0850013, perm=8, hash=59
8 label=78, uid=80=i [69 ]a score=-0.778492, c=-0.0850016, End perm=8, hash=694c prev:null_char score=-0.693491, c=-0.085, perm=0, hash=59
9 label=78, uid=80=i [69 ]a score=-0.865298, c=-0.0868062, perm=8, hash=694c prev:label=78, uid=80=i [69 ]a score=-0.778492, c=-0.0850016, End perm=8, hash=694c
10 label=78, uid=80=i [69 ]a score=-0.950366, c=-0.0850674, perm=8, hash=694c prev:label=78, uid=80=i [69 ]a score=-0.865298, c=-0.0868062, perm=8, hash=694c
11 label=78, uid=80=i [69 ]a score=-1.03538, c=-0.0850176, perm=8, hash=694c prev:label=78, uid=80=i [69 ]a score=-0.950366, c=-0.0850674, perm=8, hash=694c
12 null_char score=-1.1204, c=-0.0850117, perm=0, hash=694c prev:label=78, uid=80=i [69 ]a score=-1.03538, c=-0.0850176, perm=8, hash=694c
13 null_char score=-1.20545, c=-0.0850533, perm=0, hash=694c prev:null_char score=-1.1204, c=-0.0850117, perm=0, hash=694c
14 label=74, uid=76=s [73 ]a score=-1.29065, c=-0.0851989, End perm=8, hash=7c37f2 prev:null_char score=-1.20545, c=-0.0850533, perm=0, hash=694c
15 label=74, uid=76=s [73 ]a score=-1.37569, c=-0.0850446, perm=8, hash=7c37f2 prev:label=74, uid=76=s [73 ]a score=-1.29065, c=-0.0851989, End perm=8, hash=7c37f2
16 label=74, uid=76=s [73 ]a score=-1.46234, c=-0.0866529, perm=8, hash=7c37f2 prev:label=74, uid=76=s [73 ]a score=-1.37569, c=-0.0850446, perm=8, hash=7c37f2
17 label=74, uid=76=s [73 ]a score=-1.54738, c=-0.0850319, perm=8, hash=7c37f2 prev:label=74, uid=76=s [73 ]a score=-1.46234, c=-0.0866529, perm=8, hash=7c37f2
18 null_char score=-1.63238, c=-0.085, perm=0, hash=7c37f2 prev:label=74, uid=76=s [73 ]a score=-1.54738, c=-0.0850319, perm=8, hash=7c37f2
19 null_char score=-1.78606, c=-0.153684, perm=0, hash=7c37f2 prev:null_char score=-1.63238, c=-0.085, perm=0, hash=7c37f2
20 label=0, uid=0=  [20 ] score=-1.87108, c=-0.0850225, perm=8, hash=9289ff7c prev:null_char score=-1.78606, c=-0.153684, perm=0, hash=7c37f2
21 label=0, uid=0=  [20 ] score=-2.06266, c=-0.191574, perm=8, hash=9289ff7c prev:label=0, uid=0=  [20 ] score=-1.87108, c=-0.0850225, perm=8, hash=9289ff7c
22 label=0, uid=0=  [20 ] score=-2.25419, c=-0.191534, perm=8, hash=9289ff7c prev:label=0, uid=0=  [20 ] score=-2.06266, c=-0.191574, perm=8, hash=9289ff7c
23 null_char score=-2.44628, c=-0.192088, perm=2, hash=9289ff7c prev:label=0, uid=0=  [20 ] score=-2.25419, c=-0.191534, perm=8, hash=9289ff7c
24 null_char score=-2.63764, c=-0.191357, perm=2, hash=9289ff7c prev:null_char score=-2.44628, c=-0.192088, perm=2, hash=9289ff7c
25 label=14, uid=16=- [2d ]p score=-2.83096, c=-0.193324, perm=2, hash=acdecb6456 prev:null_char score=-2.63764, c=-0.191357, perm=2, hash=9289ff7c
26 label=14, uid=16=- [2d ]p score=-3.07546, c=-0.244496, perm=2, hash=acdecb6456 prev:label=14, uid=16=- [2d ]p score=-2.83096, c=-0.193324, perm=2, hash=acdecb6456
27 label=14, uid=16=- [2d ]p score=-3.27022, c=-0.194767, perm=2, hash=acdecb6456 prev:label=14, uid=16=- [2d ]p score=-3.07546, c=-0.244496, perm=2, hash=acdecb6456
28 null_char score=-3.4677, c=-0.197473, perm=2, hash=acdecb6456 prev:label=14, uid=16=- [2d ]p score=-3.27022, c=-0.194767, perm=2, hash=acdecb6456
29 null_char score=-3.65897, c=-0.19127, perm=2, hash=acdecb6456 prev:null_char score=-3.4677, c=-0.197473, perm=2, hash=acdecb6456
30 label=0, uid=0=  [20 ] score=-3.74404, c=-0.0850688, DawgStart perm=0, hash=cbeed3f05d74 prev:null_char score=-3.65897, c=-0.19127, perm=2, hash=acdecb6456
31 label=0, uid=0=  [20 ] score=-3.83231, c=-0.0882736, perm=0, hash=cbeed3f05d74 prev:label=0, uid=0=  [20 ] score=-3.74404, c=-0.0850688, DawgStart perm=0, hash=cbeed3f05d74
32 label=0, uid=0=  [20 ] score=-3.91762, c=-0.08531, perm=0, hash=cbeed3f05d74 prev:label=0, uid=0=  [20 ] score=-3.83231, c=-0.0882736, perm=0, hash=cbeed3f05d74
33 label=0, uid=0=  [20 ] score=-4.00262, c=-0.085002, perm=0, hash=cbeed3f05d74 prev:label=0, uid=0=  [20 ] score=-3.91762, c=-0.08531, perm=0, hash=cbeed3f05d74
34 null_char score=-4.08766, c=-0.0850354, perm=0, hash=cbeed3f05d74 prev:label=0, uid=0=  [20 ] score=-4.00262, c=-0.085002, perm=0, hash=cbeed3f05d74
35 null_char score=-4.17984, c=-0.0921804, perm=0, hash=cbeed3f05d74 prev:null_char score=-4.08766, c=-0.0850354, perm=0, hash=cbeed3f05d74
36 label=86, uid=88=c [63 ]a score=-4.26509, c=-0.0852522, Start End perm=8, hash=f093be058e3f2e prev:null_char score=-4.17984, c=-0.0921804, perm=0, hash=cbeed3f05d74
37 label=86, uid=88=c [63 ]a score=-4.35145, c=-0.0863558, perm=8, hash=f093be058e3f2e prev:label=86, uid=88=c [63 ]a score=-4.26509, c=-0.0852522, Start End perm=8, hash=f093be058e3f2e
38 label=86, uid=88=c [63 ]a score=-4.43671, c=-0.085263, perm=8, hash=f093be058e3f2e prev:label=86, uid=88=c [63 ]a score=-4.35145, c=-0.0863558, perm=8, hash=f093be058e3f2e
39 null_char score=-4.52338, c=-0.0866698, perm=0, hash=f093be058e3f2e prev:label=86, uid=88=c [63 ]a score=-4.43671, c=-0.085263, perm=8, hash=f093be058e3f2e
40 null_char score=-4.60841, c=-0.0850318, perm=0, hash=f093be058e3f2e prev:null_char score=-4.52338, c=-0.0866698, perm=0, hash=f093be058e3f2e
41 label=84, uid=86=l [6c ]a score=-4.69345, c=-0.0850427, End perm=8, hash=1bce4a2a8dce8899 prev:null_char score=-4.60841, c=-0.0850318, perm=0, hash=f093be058e3f2e
42 label=84, uid=86=l [6c ]a score=-4.80252, c=-0.109067, perm=8, hash=1bce4a2a8dce8899 prev:label=84, uid=86=l [6c ]a score=-4.69345, c=-0.0850427, End perm=8, hash=1bce4a2a8dce8899
43 label=84, uid=86=l [6c ]a score=-4.89457, c=-0.0920493, perm=8, hash=1bce4a2a8dce8899 prev:label=84, uid=86=l [6c ]a score=-4.80252, c=-0.109067, perm=8, hash=1bce4a2a8dce8899
44 label=84, uid=86=l [6c ]a score=-4.97963, c=-0.0850596, perm=8, hash=1bce4a2a8dce8899 prev:label=84, uid=86=l [6c ]a score=-4.89457, c=-0.0920493, perm=8, hash=1bce4a2a8dce8899
45 null_char score=-5.06463, c=-0.085006, perm=0, hash=1bce4a2a8dce8899 prev:label=84, uid=86=l [6c ]a score=-4.97963, c=-0.0850596, perm=8, hash=1bce4a2a8dce8899
46 null_char score=-5.14988, c=-0.0852435, perm=0, hash=1bce4a2a8dce8899 prev:null_char score=-5.06463, c=-0.085006, perm=0, hash=1bce4a2a8dce8899
47 label=73, uid=75=e [65 ]a score=-5.23488, c=-0.0850003, End perm=8, hash=cd5b7e3349a524e7 prev:null_char score=-5.14988, c=-0.0852435, perm=0, hash=1bce4a2a8dce8899
48 label=73, uid=75=e [65 ]a score=-5.32002, c=-0.0851438, perm=8, hash=cd5b7e3349a524e7 prev:label=73, uid=75=e [65 ]a score=-5.23488, c=-0.0850003, End perm=8, hash=cd5b7e3349a524e7
49 label=73, uid=75=e [65 ]a score=-5.4057, c=-0.0856761, perm=8, hash=cd5b7e3349a524e7 prev:label=73, uid=75=e [65 ]a score=-5.32002, c=-0.0851438, perm=8, hash=cd5b7e3349a524e7
50 label=73, uid=75=e [65 ]a score=-5.49112, c=-0.0854247, perm=8, hash=cd5b7e3349a524e7 prev:label=73, uid=75=e [65 ]a score=-5.4057, c=-0.0856761, perm=8, hash=cd5b7e3349a524e7
51 null_char score=-5.57612, c=-0.0850005, perm=0, hash=cd5b7e3349a524e7 prev:label=73, uid=75=e [65 ]a score=-5.49112, c=-0.0854247, perm=8, hash=cd5b7e3349a524e7
52 null_char score=-5.66181, c=-0.0856828, perm=0, hash=cd5b7e3349a524e7 prev:null_char score=-5.57612, c=-0.0850005, perm=0, hash=cd5b7e3349a524e7
53 label=82, uid=84=a [61 ]a score=-5.74681, c=-0.0850002, End perm=8, hash=41eee080e0d189c6 prev:null_char score=-5.66181, c=-0.0856828, perm=0, hash=cd5b7e3349a524e7
54 label=82, uid=84=a [61 ]a score=-5.83184, c=-0.0850337, perm=8, hash=41eee080e0d189c6 prev:label=82, uid=84=a [61 ]a score=-5.74681, c=-0.0850002, End perm=8, hash=41eee080e0d189c6
55 label=82, uid=84=a [61 ]a score=-5.91808, c=-0.0862392, perm=8, hash=41eee080e0d189c6 prev:label=82, uid=84=a [61 ]a score=-5.83184, c=-0.0850337, perm=8, hash=41eee080e0d189c6
56 label=82, uid=84=a [61 ]a score=-6.00311, c=-0.0850275, perm=8, hash=41eee080e0d189c6 prev:label=82, uid=84=a [61 ]a score=-5.91808, c=-0.0862392, perm=8, hash=41eee080e0d189c6
57 null_char score=-6.08811, c=-0.0850001, perm=0, hash=41eee080e0d189c6 prev:label=82, uid=84=a [61 ]a score=-6.00311, c=-0.0850275, perm=8, hash=41eee080e0d189c6
58 null_char score=-6.18576, c=-0.0976547, perm=0, hash=41eee080e0d189c6 prev:null_char score=-6.08811, c=-0.0850001, perm=0, hash=41eee080e0d189c6
59 label=68, uid=70=r [72 ]a score=-6.27077, c=-0.0850028, End perm=8, hash=c7ccd80937308825 prev:null_char score=-6.18576, c=-0.0976547, perm=0, hash=41eee080e0d189c6
60 label=68, uid=70=r [72 ]a score=-6.35581, c=-0.0850432, perm=8, hash=c7ccd80937308825 prev:label=68, uid=70=r [72 ]a score=-6.27077, c=-0.0850028, End perm=8, hash=c7ccd80937308825
61 label=68, uid=70=r [72 ]a score=-6.4409, c=-0.0850943, perm=8, hash=c7ccd80937308825 prev:label=68, uid=70=r [72 ]a score=-6.35581, c=-0.0850432, perm=8, hash=c7ccd80937308825
62 null_char score=-6.52711, c=-0.0862095, perm=0, hash=c7ccd80937308825 prev:label=68, uid=70=r [72 ]a score=-6.4409, c=-0.0850943, perm=8, hash=c7ccd80937308825
63 null_char score=-6.61215, c=-0.0850359, perm=0, hash=c7ccd80937308825 prev:null_char score=-6.52711, c=-0.0862095, perm=0, hash=c7ccd80937308825
64 null_char score=-6.69715, c=-0.0850039, perm=0, hash=c7ccd80937308825 prev:null_char score=-6.61215, c=-0.0850359, perm=0, hash=c7ccd80937308825

Second choice path:
2 91=w [77 ]a r=0.608491, c=-0.0934861, s=1, e=1, perm=8
8 80=i [69 ]a r=0.426893, c=-0.0868062, s=0, e=1, perm=8
14 76=s [73 ]a r=0.750678, c=-0.153684, s=0, e=1, perm=8
20 0=  [20 ] r=0.468131, c=-0.191574, s=0, e=0, perm=8
25 16=- [2d ]p r=1.40478, c=-0.244496, s=0, e=0, perm=2
30 0=  [20 ] r=0.773222, c=-0.198616, s=0, e=0, perm=2
36 88=c [63 ]a r=0.976695, c=-0.207406, s=0, e=0, perm=2
41 86=l [6c ]a r=1.22157, c=-0.245401, s=0, e=0, perm=2
47 75=e [65 ]a r=1.15086, c=-0.192771, s=0, e=0, perm=2
53 84=a [61 ]a r=1.15196, c=-0.194038, s=0, e=0, perm=2
59 70=r [72 ]a r=1.5616, c=-0.219723, s=0, e=0, perm=2
Path total rating = 10.4949
2 91=w [77 ]a r=0.608491, c=-0.0934861, s=1, e=1, perm=8
8 80=i [69 ]a r=0.426893, c=-0.0868062, s=0, e=1, perm=8
14 76=s [73 ]a r=0.750678, c=-0.153684, s=0, e=1, perm=8
20 0=  [20 ] r=0.468131, c=-0.191574, s=0, e=0, perm=8
25 16=- [2d ]p r=1.01603, c=-0.244496, s=0, e=0, perm=2
30 0=  [20 ] r=0.732398, c=-0.085002, s=0, e=0, perm=0
36 88=c [63 ]a r=0.434087, c=-0.0921804, s=1, e=1, perm=8
41 86=l [6c ]a r=0.542921, c=-0.109067, s=0, e=1, perm=8
47 75=e [65 ]a r=0.511494, c=-0.0856761, s=0, e=1, perm=8
53 84=a [61 ]a r=0.511984, c=-0.0862392, s=0, e=1, perm=8
59 70=r [72 ]a r=0.694044, c=-0.0976547, s=0, e=1, perm=8
Path total rating = 6.69715
Best choice: accepted=1, adaptable=0, done=1 : Lang result : wis : R=1.78606, C=-1.34102, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM
str     w       i       s
state:  1       1       1
C       -0.093  -0.087  -0.154
Best choice: accepted=1, adaptable=0, done=1 : Lang result : - : R=1.01603, C=-1.71147, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos     NORM
str     -
state:  1
C       -0.244
Best choice: accepted=1, adaptable=0, done=1 : Lang result : clear : R=2.69453, C=-0.763471, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM    NORM    NORM
str     c       l       e       a       r
state:  1       1       1       1       1
C       -0.092  -0.109  -0.086  -0.086  -0.098
1 new words better than 0 old words: r: 1.78606 v 0 c: -1.34102 v 0 valid dict: 1 v 0
1 new words better than 0 old words: r: 1.01603 v 0 c: -1.71147 v 0 valid dict: 0 v 0
1 new words better than 0 old words: r: 2.69453 v 0 c: -0.763471 v 0 valid dict: 1 v 0
Pass1: clear [c [63 ]a l [6c ]a e [65 ]a a [61 ]a r [72 ]a ]

Expected Behavior:

Clearly readable text should be recognized without failure.

Suggested Fix:

stweil commented 2 years ago

I tried git bisect now with tessdata_fast/eng and did not find a Tesseract release without that issue. Even 4.0.0 creates the double content in my test.

amitdo commented 2 years ago

AFAIK, fast was trained on inverted text and non-inverted text and on upright pages and upside down pages.

rmast commented 2 years ago

I'm not convinced it's inversion related. I think it already comes from somewhere where segments are propagated into each other, probably searching underlines. If I run this statement wis-clear is still double, and print is still missing:

tesseract --dpi 300 -l Latin 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg outputwithoutinvert

So textline inversion might be removed as a label.

rmast commented 2 years ago

By the way, this one is compiled without legacy, so it's in the new parts

rmast commented 2 years ago
Testing underline on blob at (2149,3149)->(2396,3189), base=3160
Occs:247 247 247
Testing underline on blob at (2149,3103)->(2396,3144), base=3085
Occs:0 0 247
Underlined blob at:Bounding box=(2149,3103)->(2396,3144)
Was:Bounding box=(2149,3103)->(2396,3144)
Segmenting baseline of 19 blobs at (2149,3149)
Made 1 segments on row at (2355,3149)
Segmenting baseline of 11 blobs at (2164,3113)
Made 1 segments on row at (2307,3114)

Input height=26.25, Estimate x-height=40 pixels, jumplimit=6.00
1(2168,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
2(2189,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
3(2209,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
4(2230,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
5(2251,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
6(2271,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
7(2292,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
8(2313,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
9(2333,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
10(2354,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
11(2375,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
1(2168,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
0(2149,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0

Input height=26.25, Estimate x-height=16 pixels, jumplimit=2.40
4(2269,3114), Diff=-0.32, Delta=0.000, Drift=0.000, P=0
5(2286,3114), Diff=-0.47, Delta=-0.148, Drift=0.000, P=0
6(2293,3114), Diff=-0.62, Delta=-0.247, Drift=-0.049, P=0
7(2310,3114), Diff=-0.81, Delta=-0.362, Drift=-0.132, P=0
8(2327,3115), Diff=0.00, Delta=0.573, Drift=-0.252, P=0
4(2269,3114), Diff=-0.32, Delta=0.000, Drift=0.000, P=0
3(2224,3114), Diff=0.25, Delta=0.568, Drift=0.000, P=0
2(2217,3115), Diff=1.38, Delta=1.514, Drift=0.189, P=0
1(2194,3115), Diff=1.57, Delta=1.195, Drift=0.694, P=0
0(2164,3113), Diff=0.00, Delta=-0.771, Drift=1.092, P=0
First turn is 0 at (2169,3113)
Turn 1 is 1 at (2204,3115), mid pt is 0@2169, final @2187
Segmenting baseline of 34 blobs at (1902,2842)
Made 1 segments on row at (2347,2841)
stweil commented 2 years ago

So textline inversion might be removed as a label.

I also no longer think that it is related to textline inversion as the issue also occurs in old versions like 4.0.0. My previous git bisect result was misleading.

By the way, this one is compiled without legacy, so it's in the new parts

The layout detection is mostly still old code.

rmast commented 2 years ago

I've now pinpointed the disappearing upper boundingbox from Block1 textord.cppBlock 28Bounding box=(2149,3103)->(2396,3189) Bounding box=(2149,3149)->(2396,3189) Bounding box=(2149,3103)->(2396,3144) Bounding box=(2194,3114)->(2237,3137) Bounding box=(2249,3121)->(2257,3125) Bounding box=(2269,3114)->(2336,3137) /Block as disappearing in textord.cpp // Remove empties. cleanup_blocks(PSM_WORD_FIND_ENABLED(pageseg_mode), blocks);

rmast commented 2 years ago

This might be involved:

B:28 R:1 -- Can't do isolated row stats. B:28 R:1 -- Inadequate certain spaces.

tesseract -c textord_restore_underlines=1 --dpi 300 -l Latin -c textord_noise_rejrows=0 -c textord_debug_block=28 -c textord_noise_debug=1 -c textord_debug_tabfind=1 -c textord_debug_bugs=1 -c textord_show_final_rows=1 -c tosp_debug_level=6 /home/rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg outputdebug85online

amitdo commented 2 years ago

3871-ROI

With this image, I get an empty output with all available eng/Latin models.

amitdo commented 2 years ago

After upscaling (2x)

3871-ROI-x2

output:

> wis - clear

amitdo commented 2 years ago

After upscaling (4x)

3871-ROI-x4

output:

> print

> Wis - clear
zdenop commented 2 years ago

We know that some parts are skipped at complex layout (table-like) images. Tesseract has just a basic document layout analysis.

Do your own layout segmentation for all complicated document layouts and store it in uzn file/each segment OCR individually. Also, I suggest the following docs (black text on white background).

Here is an example of amitdo's test image.

tesseract inverted.png - --psm 4
UZN file inverted.uzn loaded.
> print

> wis - clear

i3871_inverted.zip

rmast commented 2 years ago

Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.

Or could you say that x4 upscaling in general does a better job?

rmast commented 2 years ago

3871-ROI

With this image, I get an empty output with all available eng/Latin models.

That's interesting! That makes focussing on the issue easier.

I've run it with tesseract -c textord_restore_underlines=1 --dpi 300 -l Latin -c textord_noise_rejrows=0 -c textord_debug_block=28 -c textord_noise_debug=1 -c textord_debug_tabfind=1 -c textord_debug_bugs=1 -c textord_show_final_rows=1 -c tosp_debug_level=6 -c tosp_redo_kern_limit=1 -c tosp_enough_small_gaps=0.05 -c tosp_gap_factor=0.17 -c tosp_row_use_cert_spaces=false doetiehet.png output

and it gave D print | D wis- crear |

when the cleanup_blocks was commented out.

With the cleanup blocks this was the debug-result (and no output).

Vertical skew vector=(0,1) Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar /home/rmast/tesseract/java/ScrollView.jar & wait" ScrollView: Waiting for server... Socket started on port 8461 Client connected Click at (176, 180) Click at (176, 180) Inserted 18 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Moved 0 large blobs to normal list Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Vertical skew vector=(0,1) Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Found 1 Column candidates: Found 1 Improved columns: Found 1 Final Columns: Column id 0 applies to range = 0 - 11 Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Considering part for merge at:ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=8 ts=0 bs=0 ls=0 rs=0 Considering part for merge at:ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=12 ts=0 bs=0 ls=0 rs=0 Changed column groups at grid index 5, y=130 ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=8 ts=0 bs=0 ls=0 rs=0 side step = 6.50, top spacing = 45, bottom spacing=46 ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=12 ts=0 bs=0 ls=0 rs=0 side step = 2.50, top spacing = 262, bottom spacing=262 Spacings unequal: upper:45/46, lower:262/262, sizes 40 41 0 Added line to current block. Making block at (73,106)->(320,192) Found 1 blocks, 1 to_blocks Blk 1, type 1 rerotation(1.00, -0.00), char(0.00,0.00), box:Bounding box=(73,106)->(320,192) Testing underline on blob at (73,152)->(320,192), base=163 Occs:247 247 247 Testing underline on blob at (73,106)->(320,147), base=114 Occs:247 247 247 B:1 R:1 -- Can't do isolated row stats. B:1 R:1 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:1 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 B:1 R:2 -- Can't do isolated row stats. B:1 R:2 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:2 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 Row: Made 1 words in row ((73,152)(320,192)) Row: Made 1 words in row ((73,106)(320,147)) cleanup_blocks: # rows = 0 / 2 cleanup_blocks: # blocks = 0 / 1 Vertical skew vector=(0,1) Click at (195, 174) Click at (195, 174) Click at (188, 157) Click at (188, 157) Click at (206, 143) Click at (206, 143) Inserted 18 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Moved 0 large blobs to normal list Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Vertical skew vector=(0,1) Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Found 1 Column candidates: Found 1 Improved columns: Found 1 Final Columns: Column id 0 applies to range = 0 - 11 Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Considering part for merge at:ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=8 ts=0 bs=0 ls=0 rs=0 Considering part for merge at:ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=12 ts=0 bs=0 ls=0 rs=0 Changed column groups at grid index 5, y=130 ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=8 ts=0 bs=0 ls=0 rs=0 side step = 6.50, top spacing = 45, bottom spacing=46 ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=12 ts=0 bs=0 ls=0 rs=0 side step = 2.50, top spacing = 262, bottom spacing=262 Spacings unequal: upper:45/46, lower:262/262, sizes 40 41 0 Added line to current block. Making block at (73,106)->(320,192) Found 1 blocks, 1 to_blocks Blk 1, type 1 rerotation(1.00, -0.00), char(0.00,0.00), box:Bounding box=(73,106)->(320,192) Testing underline on blob at (73,152)->(320,192), base=163 Occs:247 247 247 Testing underline on blob at (73,106)->(320,147), base=114 Occs:247 247 247 B:1 R:1 -- Can't do isolated row stats. B:1 R:1 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:1 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 B:1 R:2 -- Can't do isolated row stats. B:1 R:2 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:2 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 Row: Made 1 words in row ((73,152)(320,192)) Row: Made 1 words in row ((73,106)(320,147)) cleanup_blocks: # rows = 0 / 2 cleanup_blocks: # blocks = 0 / 1

-c invert_threshold=0.5 does not help recognizing the block.

rmast commented 2 years ago

./migneuzn ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg > ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.uzn tesseract ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg - --psm 4 Gives the same error:

[> wis - clear | wis - clear

So that's also a possibility to focus on the issue! Thanks for these hints!

rmast commented 2 years ago

I made my own cut-out of that image, nearly the original block 28 and there was no issue at all recognizing the text correctly:

Unfortunately cutting out the picture with Paint recoded the jpeg, so it isn't representative.

amitdo commented 2 years ago

Unfortunately cutting out the picture with Paint recoded the jpeg, so it isn't representative.

Convert the whole image to PNG first, and then do image processing.

zdenop commented 2 years ago

Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.

I saw some attempts to solve this problem with prepared templates (e.g. for invoices) base on known document source. With this approach you can skip some parts like logo, header, footer etc. to speed up OCR, or use custom OCR/postpossessing of amounts

I heard there are some attempts to do image/document segmentation by machine learning, but I did not see any open source (working) solution.

Or could you say that x4 upscaling in general does a better job?

In docs, there is link to test for optimal letter size. So scaling could help, but you need to know in advance original letter size to calculate scaling. In complicated layout with different fonts&sizes of course you need to first split image to uniform blocks...

rmast commented 2 years ago

I tried EasyOCR as segmenter. Using the segments as UZN on the image or the inverted image doesn't make a difference. I still tend to dive into the error(s) despite the lack of testeffort when the error is solved.

rmast commented 2 years ago

I'm now on a track for finding the cause of the double 'wis - clear'.

The second row of block 28 gives 4 words: A blob of the full row, "wis", "-" and "clear". The blob containing ">" is skipped when the space after it appears to end before the end of the full row in the first blob.

The full row-blob is not inverted in CheckInverseFlagAndDirection within stepblob.cpp:222. The other outlines of this row are inverted. I wonder whether the good_blob=false status plays a role here in not getting all blobs in the right order with respect to their generation (parent-child), but I guess CheckInverseFlagAndDirection based on some vague step_dir (coutln.cpp:562) might play a role as well. I don't understand how inversion is calculated here and what it has to do with steps and going counter clockwise.

rmast commented 2 years ago

-c edges_use_new_outline_complexity=1 doesn't solve these issues.

rmast commented 2 years ago

There appears to be something wrong with the decisionmaking around good and bad (rejected) blobs:

diff --git a/src/textord/tordmain.cpp b/src/textord/tordmain.cpp
index a7f2a168f..97952f1bd 100644
--- a/src/textord/tordmain.cpp
+++ b/src/textord/tordmain.cpp
@@ -668,12 +668,33 @@ void Textord::clean_small_noise_from_words(ROW *row) {
       C_OUTLINE_IT out_it(blob->out_list());
       for (out_it.mark_cycle_pt(); !out_it.cycled_list(); out_it.forward()) {
         C_OUTLINE *outline = out_it.data();
+        tprintf("Good %d %d %d %d Robert \n", outline->bounding_box().botleft().x()
+        , outline->bounding_box().botleft().y()
+        , outline->bounding_box().topright().x()
+        , outline->bounding_box().topright().y()
+        );
         outline->RemoveSmallRecursive(min_size, &out_it);
       }
       if (blob->out_list()->empty()) {
         delete blob_it.extract();
       }
     }
+    C_BLOB_IT blob_it2(word->rej_cblob_list());
+    for (blob_it2.mark_cycle_pt(); !blob_it2.cycled_list(); blob_it2.forward()) {
+      C_BLOB *blob = blob_it2.data();
+      C_OUTLINE_IT out_it(blob->out_list());
+      for (out_it.mark_cycle_pt(); !out_it.cycled_list(); out_it.forward()) {
+        C_OUTLINE *outline = out_it.data();
+        tprintf("Rejected %d %d %d %d Robert \n", outline->bounding_box().botleft().x()
+        , outline->bounding_box().botleft().y()
+        , outline->bounding_box().topright().x()
+        , outline->bounding_box().topright().y()
+        );
+      }
+    }
+
+
+
     if (word->cblob_list()->empty()) {
       if (!word_it.at_last()) {
         // The next word is no longer a fuzzy non space if it was before,

UZN file /home/rmast/kleiner3.uzn loaded.
Discarding parent of area 9897, child area=80, max8825.25 with child rect=231
Discarding parent of area 9594, child area=73, max8394.75 with child rect=231
Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar /home/rmast/tesseract/java/ScrollView.jar & wait"
ScrollView: Waiting for server...
Socket started on port 8461
Client connected
Adjusting row limits for block(2150,3188)
Row at 3183.004395 has min 3156.375000, max 3180.000000, size 23.625000
Row at 3137.788086 has min 3112.716309, max 3135.762695, size 23.046387
Row at 3183 yields spacing of 45.2163
Blob based spacing=(26.25,52.5), offset=33.177 row based=45.2163(0)
Estimate line size=26.25, spacing=52.5, offset=40.2881
Expanding bottom of row at 3137.788086 from 3132.026367 to 3131.225586
Expanding top of row at 3137.788086 from 3157.323975 to 3157.475586
Expanding bottom of row at 3183.004395 from 3176.692383 to 3176.441895
Expanding top of row at 3183.004395 from 3202.489014 to 3202.691895
Testing underline on blob at (2150,3149)->(2396,3188), base=3160
Occs:246 246 246
Testing underline on blob at (2150,3103)->(2396,3144), base=3085
Occs:0 0 246
Underlined blob at:Bounding box=(2150,3103)->(2396,3144)
Was:Bounding box=(2150,3103)->(2396,3144)
B:1 R:1 -- Can't do isolated row stats.
B:1 R:1 -- Inadequate certain spaces.
B:1 R:1 L:246-- Kn:3 Sp:12 Thr:7 -- Kn:3.00 (5) Thr:7 (12) Sp:12.00
B:1 R:2 L:186-- Kn:3 Sp:12 Thr:7 -- Kn:3.00 (5) Thr:7 (10) Sp:12.75
Row: Made 1 words in row ((2150,3149)(2396,3188))
Row: Made 4 words in row ((2150,3103)(2396,3144))
Rejected 2150 3149 2396 3188 Robert 
Rejected 2164 3159 2175 3180 Robert 
Rejected 2195 3154 2209 3176 Robert 
Rejected 2212 3160 2220 3176 Robert 
Rejected 2223 3160 2227 3176 Robert 
Rejected 2223 3178 2227 3182 Robert 
Rejected 2231 3160 2244 3176 Robert 
Rejected 2247 3160 2256 3180 Robert 
Good 2150 3103 2396 3144 Robert 
Rejected 2164 3113 2175 3134 Robert 
Good 2194 3115 2214 3130 Robert 
Good 2217 3115 2221 3130 Robert 
Good 2217 3132 2221 3137 Robert 
Good 2224 3114 2237 3130 Robert 
Good 2249 3121 2257 3125 Robert 
Good 2269 3114 2283 3130 Robert 
Good 2286 3114 2290 3137 Robert 
Good 2293 3114 2307 3130 Robert 
Good 2310 3114 2323 3130 Robert 
Good 2327 3115 2336 3130 Robert 
Row ending at (2336,3114.29): R=0.111111, dc=1, nc=9, ACCEPTED
cleanup_blocks: # rows = 1 / 2
cleanup_blocks: # blocks = 1 / 1
> wis -clear | wis - clear

The parent of the lower row appears to be kept alive, while the children of the upper row are all rejected as well.

Keeping the parent of the lower row alive makes the > rejected.

rmast commented 2 years ago

During this part of processing good is still good and rejected is still rejected (parents are rejected, children are coming by): In het bovenste deel zijn de goeden nog goed

During Textord::filter_blobs (in 3.04.01/leptonica1.74.0 to get optimal performance of the ScrollView ) using -c textord_show_boxes=1:

diff --git a/textord/tordmain.cpp b/textord/tordmain.cpp
index 14cb7171..2a9c8815 100644
--- a/textord/tordmain.cpp
+++ b/textord/tordmain.cpp
@@ -272,9 +272,9 @@ void Textord::filter_blobs(ICOORD page_tr,         // top right
       if (to_win == NULL)
         create_to_win(page_tr);
       plot_box_list(to_win, &block->noise_blobs, ScrollView::WHITE);
-      plot_box_list(to_win, &block->small_blobs, ScrollView::WHITE);
-      plot_box_list(to_win, &block->large_blobs, ScrollView::WHITE);
-      plot_box_list(to_win, &block->blobs, ScrollView::WHITE);
+      plot_box_list(to_win, &block->small_blobs, ScrollView::RED);
+      plot_box_list(to_win, &block->large_blobs, ScrollView::GREEN);
+      plot_box_list(to_win, &block->blobs, ScrollView::BLUE);
     }
     #endif  // GRAPHICS_DISABLED
   }

Textord filter_blobs

So dots and minuses remain noise, rejected parents are from now on called large blobs, the separate letters are just 'blobs'. I'm not sure if this path of translation is the only suspect translation done, as it isn't done on word level, but on block-level

rmast commented 2 years ago

Just killing the non-inverted parents in stepblob.cpp solves the issue for both lines:

> print
> wis - clear

diff --git a/src/ccstruct/stepblob.cpp b/src/ccstruct/stepblob.cpp
index 4c61b6c65..aac639747 100644
--- a/src/ccstruct/stepblob.cpp
+++ b/src/ccstruct/stepblob.cpp
@@ -209,7 +209,7 @@ void C_BLOB::ConstructBlobsFromOutlines(bool good_blob, C_OUTLINE_LIST *outline_
     blob->CheckInverseFlagAndDirection();
     // Put on appropriate list.
     if (!blob_is_good && bad_blobs_it != nullptr) {
-      bad_blobs_it->add_after_then_move(blob);
+      //bad_blobs_it->add_after_then_move(blob);
     } else {
       good_blobs_it->add_after_then_move(blob);
     }

However, the question is whether there are examples of parents that may not be killed. With what conditions should (parts of) parents be preserved, and should those parents be inverted if their children are inverted as well?

Are there other paths leading to this !blob_is_good that I miss when I just kill everything as if they are parents of maintained children?

rmast commented 2 years ago

When the parents are left as in the original code during make_prop words there are much to much blobs per row left. For the '> print' row there are blobs that seem to represent the spaces, and at the end of the row there even seems to be some artificial spacing of 21 or 22 positions. Instead of the 8 expected blobs, of which 2 were already rejected there appear to be 19 blobs. row blob list print-regel

When traversing the bounding boxes of those blobs the spaces and some combinations of inverted letters seem to have made up some extra boxes. The better reading 'wis - clear' line doesn't contain such intermittent space-blobs, so I guess they're the uninverted revived parents cut at the spacing with their children.

For the '> wis - clear' row there are less superfluous blobnbox'es.

Of the blobnboxes that comprise the complete block filter_blobs block blobs 18 andere unieke

there are 2 new blobnboxes made up: row blob list wis-clear-regel

0x5555555acdd0: Complete revived parent. 0x55555559f630: Letter 'w', probably a fake-blob seeded duplicate from the rejected parent.

I'll just try to look whether killing the parents at the proposed spot appears to have unwanted side-effects...

rmast commented 2 years ago

I tried the effects of killing the parents on 5.1.0 with the full page using ocrmypdf.

ocrmypdf --image-dpi 300 --pdfa-image-compression lossless -O0 ../rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg formulierhocrjpgmetpatch5.1.0.pdf

For some reason the resulting selection from Adobe Acrobat Reader improves with this patch:

The second column 'Waarom dit formulier?' can be selected separately with my patched version, while selecting it on the original 5.1.0- version tries to select the second column in parallell and pastes the lines intermixed.

formulierhocrjpgmetpatch5.1.0.pdf formulierhocrjpg.pdf

rmast commented 2 years ago

With 5.2.0 default settings the inverted Toelichting 2.1 is correctly read, however with none of the versions the bottom line with the ®-sign is complete.

amitdo commented 2 years ago

Don't do your tests with PDF as output. Different PDF viewers can present the same file differently.

rmast commented 2 years ago

Yes, Zathura makes a mess of the selection, not clearly showing what lines are selected or not.

rmast commented 1 year ago

Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.

Outlook voor Android downloadenhttps://aka.ms/ghei36


From: zdenop @.> Sent: Thursday, July 21, 2022 6:15:28 PM To: tesseract-ocr/tesseract @.> Cc: rmast @.>; Author @.> Subject: Re: [tesseract-ocr/tesseract] Of two inverted top right texts one gets scanned double, the upper one disappears (Issue #3871)

We know that some parts are skipped at complex layout (table-like) images. Tesseract has just a basic document layout analysis.

Do your own layout segmentation for all complicated document layouts and store it in uzn file/each segment OCR individually. Also, I suggest the following docshttps://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md (black text on white background).

Here is an example of amitdo's test image.

tesseract inverted.png - --psm 4 UZN file inverted.uzn loaded.

print

wis - clear

i3871_inverted.ziphttps://github.com/tesseract-ocr/tesseract/files/9160801/i3871_inverted.zip

— Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/tesseract/issues/3871#issuecomment-1191685419, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5WP4TMDHCBS3GKWHITVVFZSBANCNFSM532FCG2Q. You are receiving this because you authored the thread.Message ID: @.***>

rmast commented 1 year ago

An error should not be blurred with manipulating the source-image until someone looking at it approves the result. Errors should be examined and solved, aiming at a Tesseract that operates unattended. At least for the purpose of the image compression Merlijn Wajer wants to reach at the internet archive.

Outlook voor Android downloadenhttps://aka.ms/ghei36


From: Amit D. @.> Sent: Thursday, July 21, 2022 4:27:25 PM To: tesseract-ocr/tesseract @.> Cc: rmast @.>; Author @.> Subject: Re: [tesseract-ocr/tesseract] Of two inverted top right texts one gets scanned double, the upper one disappears (Issue #3871)

After upscaling (4x)

[3871-ROI-x4]https://user-images.githubusercontent.com/13571208/180238807-43dcbfcc-ab3b-4779-9ac1-d5ca23ad1d47.png

output:

print

Wis - clear

— Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/tesseract/issues/3871#issuecomment-1191554064, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5TIOZ3KO6PBYEJZLULVVFM43ANCNFSM532FCG2Q. You are receiving this because you authored the thread.Message ID: @.***>