Open rmast opened 2 years ago
I tried git bisect
now with tessdata_fast/eng
and did not find a Tesseract release without that issue. Even 4.0.0 creates the double content in my test.
AFAIK, fast was trained on inverted text and non-inverted text and on upright pages and upside down pages.
I'm not convinced it's inversion related. I think it already comes from somewhere where segments are propagated into each other, probably searching underlines. If I run this statement wis-clear is still double, and print is still missing:
tesseract --dpi 300 -l Latin 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg outputwithoutinvert
So textline inversion might be removed as a label.
By the way, this one is compiled without legacy, so it's in the new parts
Testing underline on blob at (2149,3149)->(2396,3189), base=3160
Occs:247 247 247
Testing underline on blob at (2149,3103)->(2396,3144), base=3085
Occs:0 0 247
Underlined blob at:Bounding box=(2149,3103)->(2396,3144)
Was:Bounding box=(2149,3103)->(2396,3144)
Segmenting baseline of 19 blobs at (2149,3149)
Made 1 segments on row at (2355,3149)
Segmenting baseline of 11 blobs at (2164,3113)
Made 1 segments on row at (2307,3114)
Input height=26.25, Estimate x-height=40 pixels, jumplimit=6.00
1(2168,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
2(2189,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
3(2209,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
4(2230,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
5(2251,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
6(2271,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
7(2292,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
8(2313,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
9(2333,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
10(2354,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
11(2375,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
1(2168,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
0(2149,3149), Diff=-11.00, Delta=0.000, Drift=0.000, P=0
Input height=26.25, Estimate x-height=16 pixels, jumplimit=2.40
4(2269,3114), Diff=-0.32, Delta=0.000, Drift=0.000, P=0
5(2286,3114), Diff=-0.47, Delta=-0.148, Drift=0.000, P=0
6(2293,3114), Diff=-0.62, Delta=-0.247, Drift=-0.049, P=0
7(2310,3114), Diff=-0.81, Delta=-0.362, Drift=-0.132, P=0
8(2327,3115), Diff=0.00, Delta=0.573, Drift=-0.252, P=0
4(2269,3114), Diff=-0.32, Delta=0.000, Drift=0.000, P=0
3(2224,3114), Diff=0.25, Delta=0.568, Drift=0.000, P=0
2(2217,3115), Diff=1.38, Delta=1.514, Drift=0.189, P=0
1(2194,3115), Diff=1.57, Delta=1.195, Drift=0.694, P=0
0(2164,3113), Diff=0.00, Delta=-0.771, Drift=1.092, P=0
First turn is 0 at (2169,3113)
Turn 1 is 1 at (2204,3115), mid pt is 0@2169, final @2187
Segmenting baseline of 34 blobs at (1902,2842)
Made 1 segments on row at (2347,2841)
So textline inversion might be removed as a label.
I also no longer think that it is related to textline inversion as the issue also occurs in old versions like 4.0.0. My previous git bisect
result was misleading.
By the way, this one is compiled without legacy, so it's in the new parts
The layout detection is mostly still old code.
I've now pinpointed the disappearing upper boundingbox from Block1 textord.cppBlock 28Bounding box=(2149,3103)->(2396,3189) Bounding box=(2149,3149)->(2396,3189) Bounding box=(2149,3103)->(2396,3144) Bounding box=(2194,3114)->(2237,3137) Bounding box=(2249,3121)->(2257,3125) Bounding box=(2269,3114)->(2336,3137) /Block as disappearing in textord.cpp // Remove empties. cleanup_blocks(PSM_WORD_FIND_ENABLED(pageseg_mode), blocks);
This might be involved:
B:28 R:1 -- Can't do isolated row stats. B:28 R:1 -- Inadequate certain spaces.
tesseract -c textord_restore_underlines=1 --dpi 300 -l Latin -c textord_noise_rejrows=0 -c textord_debug_block=28 -c textord_noise_debug=1 -c textord_debug_tabfind=1 -c textord_debug_bugs=1 -c textord_show_final_rows=1 -c tosp_debug_level=6 /home/rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg outputdebug85online
With this image, I get an empty output with all available eng/Latin models.
After upscaling (2x)
output:
> wis - clear
After upscaling (4x)
output:
> print
> Wis - clear
We know that some parts are skipped at complex layout (table-like) images. Tesseract has just a basic document layout analysis.
Do your own layout segmentation for all complicated document layouts and store it in uzn file/each segment OCR individually. Also, I suggest the following docs (black text on white background).
Here is an example of amitdo's test image.
tesseract inverted.png - --psm 4
UZN file inverted.uzn loaded.
> print
> wis - clear
Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.
Or could you say that x4 upscaling in general does a better job?
With this image, I get an empty output with all available eng/Latin models.
That's interesting! That makes focussing on the issue easier.
I've run it with tesseract -c textord_restore_underlines=1 --dpi 300 -l Latin -c textord_noise_rejrows=0 -c textord_debug_block=28 -c textord_noise_debug=1 -c textord_debug_tabfind=1 -c textord_debug_bugs=1 -c textord_show_final_rows=1 -c tosp_debug_level=6 -c tosp_redo_kern_limit=1 -c tosp_enough_small_gaps=0.05 -c tosp_gap_factor=0.17 -c tosp_row_use_cert_spaces=false doetiehet.png output
and it gave D print | D wis- crear |
when the cleanup_blocks was commented out.
With the cleanup blocks this was the debug-result (and no output).
Vertical skew vector=(0,1) Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar /home/rmast/tesseract/java/ScrollView.jar & wait" ScrollView: Waiting for server... Socket started on port 8461 Client connected Click at (176, 180) Click at (176, 180) Inserted 18 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Moved 0 large blobs to normal list Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Vertical skew vector=(0,1) Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Found 1 Column candidates: Found 1 Improved columns: Found 1 Final Columns: Column id 0 applies to range = 0 - 11 Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Considering part for merge at:ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=8 ts=0 bs=0 ls=0 rs=0 Considering part for merge at:ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=12 ts=0 bs=0 ls=0 rs=0 Changed column groups at grid index 5, y=130 ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=8 ts=0 bs=0 ls=0 rs=0 side step = 6.50, top spacing = 45, bottom spacing=46 ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=12 ts=0 bs=0 ls=0 rs=0 side step = 2.50, top spacing = 262, bottom spacing=262 Spacings unequal: upper:45/46, lower:262/262, sizes 40 41 0 Added line to current block. Making block at (73,106)->(320,192) Found 1 blocks, 1 to_blocks Blk 1, type 1 rerotation(1.00, -0.00), char(0.00,0.00), box:Bounding box=(73,106)->(320,192) Testing underline on blob at (73,152)->(320,192), base=163 Occs:247 247 247 Testing underline on blob at (73,106)->(320,147), base=114 Occs:247 247 247 B:1 R:1 -- Can't do isolated row stats. B:1 R:1 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:1 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 B:1 R:2 -- Can't do isolated row stats. B:1 R:2 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:2 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 Row: Made 1 words in row ((73,152)(320,192)) Row: Made 1 words in row ((73,106)(320,147)) cleanup_blocks: # rows = 0 / 2 cleanup_blocks: # blocks = 0 / 1 Vertical skew vector=(0,1) Click at (195, 174) Click at (195, 174) Click at (188, 157) Click at (188, 157) Click at (206, 143) Click at (206, 143) Inserted 18 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Moved 0 large blobs to normal list Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Beginning real tab search with vertical = 0,1... Vertical skew vector=(0,1) Checking for vertical lines Vertical skew vector=(0,1) Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Found 1 Column candidates: Found 1 Improved columns: Found 1 Final Columns: Column id 0 applies to range = 0 - 11 Inserted 0 blobs into grid, 0 rejected. Inserted 17 blobs into grid, 0 rejected. Considering part for merge at:ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=8 ts=0 bs=0 ls=0 rs=0 Considering part for merge at:ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=-1, lc=-1, boxes=12 ts=0 bs=0 ls=0 rs=0 Changed column groups at grid index 5, y=130 ColPart: (M53-B73-B73/74,152/153)->(320B-320B-340M/320,192/192) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=8 ts=0 bs=0 ls=0 rs=0 side step = 6.50, top spacing = 45, bottom spacing=46 ColPart: (M53-B73-B73/74,106/107)->(320B-320B-340M/320,147/147) w-ok=0, v-ok=0, type=1T4, fc=1, lc=1, boxes=12 ts=0 bs=0 ls=0 rs=0 side step = 2.50, top spacing = 262, bottom spacing=262 Spacings unequal: upper:45/46, lower:262/262, sizes 40 41 0 Added line to current block. Making block at (73,106)->(320,192) Found 1 blocks, 1 to_blocks Blk 1, type 1 rerotation(1.00, -0.00), char(0.00,0.00), box:Bounding box=(73,106)->(320,192) Testing underline on blob at (73,152)->(320,192), base=163 Occs:247 247 247 Testing underline on blob at (73,106)->(320,147), base=114 Occs:247 247 247 B:1 R:1 -- Can't do isolated row stats. B:1 R:1 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:1 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 B:1 R:2 -- Can't do isolated row stats. B:1 R:2 -- DON'T BELIEVE SPACE 128.00 74 20.00 -> 192.00. B:1 R:2 L:247-- Kn:128 Sp:20 Thr:74 -- Kn:128.00 (144) Thr:160 (384) Sp:192.00 Row: Made 1 words in row ((73,152)(320,192)) Row: Made 1 words in row ((73,106)(320,147)) cleanup_blocks: # rows = 0 / 2 cleanup_blocks: # blocks = 0 / 1
-c invert_threshold=0.5 does not help recognizing the block.
./migneuzn ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg > ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.uzn tesseract ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg - --psm 4 Gives the same error:
[> wis - clear | wis - clear
So that's also a possibility to focus on the issue! Thanks for these hints!
I made my own cut-out of that image, nearly the original block 28 and there was no issue at all recognizing the text correctly:
Unfortunately cutting out the picture with Paint recoded the jpeg, so it isn't representative.
Unfortunately cutting out the picture with Paint recoded the jpeg, so it isn't representative.
Convert the whole image to PNG first, and then do image processing.
Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.
I saw some attempts to solve this problem with prepared templates (e.g. for invoices) base on known document source. With this approach you can skip some parts like logo, header, footer etc. to speed up OCR, or use custom OCR/postpossessing of amounts
I heard there are some attempts to do image/document segmentation by machine learning, but I did not see any open source (working) solution.
Or could you say that x4 upscaling in general does a better job?
In docs, there is link to test for optimal letter size. So scaling could help, but you need to know in advance original letter size to calculate scaling. In complicated layout with different fonts&sizes of course you need to first split image to uniform blocks...
I tried EasyOCR as segmenter. Using the segments as UZN on the image or the inverted image doesn't make a difference. I still tend to dive into the error(s) despite the lack of testeffort when the error is solved.
I'm now on a track for finding the cause of the double 'wis - clear'.
The second row of block 28 gives 4 words: A blob of the full row, "wis", "-" and "clear". The blob containing ">" is skipped when the space after it appears to end before the end of the full row in the first blob.
The full row-blob is not inverted in CheckInverseFlagAndDirection within stepblob.cpp:222. The other outlines of this row are inverted. I wonder whether the good_blob=false status plays a role here in not getting all blobs in the right order with respect to their generation (parent-child), but I guess CheckInverseFlagAndDirection based on some vague step_dir (coutln.cpp:562) might play a role as well. I don't understand how inversion is calculated here and what it has to do with steps and going counter clockwise.
-c edges_use_new_outline_complexity=1 doesn't solve these issues.
There appears to be something wrong with the decisionmaking around good and bad (rejected) blobs:
diff --git a/src/textord/tordmain.cpp b/src/textord/tordmain.cpp
index a7f2a168f..97952f1bd 100644
--- a/src/textord/tordmain.cpp
+++ b/src/textord/tordmain.cpp
@@ -668,12 +668,33 @@ void Textord::clean_small_noise_from_words(ROW *row) {
C_OUTLINE_IT out_it(blob->out_list());
for (out_it.mark_cycle_pt(); !out_it.cycled_list(); out_it.forward()) {
C_OUTLINE *outline = out_it.data();
+ tprintf("Good %d %d %d %d Robert \n", outline->bounding_box().botleft().x()
+ , outline->bounding_box().botleft().y()
+ , outline->bounding_box().topright().x()
+ , outline->bounding_box().topright().y()
+ );
outline->RemoveSmallRecursive(min_size, &out_it);
}
if (blob->out_list()->empty()) {
delete blob_it.extract();
}
}
+ C_BLOB_IT blob_it2(word->rej_cblob_list());
+ for (blob_it2.mark_cycle_pt(); !blob_it2.cycled_list(); blob_it2.forward()) {
+ C_BLOB *blob = blob_it2.data();
+ C_OUTLINE_IT out_it(blob->out_list());
+ for (out_it.mark_cycle_pt(); !out_it.cycled_list(); out_it.forward()) {
+ C_OUTLINE *outline = out_it.data();
+ tprintf("Rejected %d %d %d %d Robert \n", outline->bounding_box().botleft().x()
+ , outline->bounding_box().botleft().y()
+ , outline->bounding_box().topright().x()
+ , outline->bounding_box().topright().y()
+ );
+ }
+ }
+
+
+
if (word->cblob_list()->empty()) {
if (!word_it.at_last()) {
// The next word is no longer a fuzzy non space if it was before,
UZN file /home/rmast/kleiner3.uzn loaded.
Discarding parent of area 9897, child area=80, max8825.25 with child rect=231
Discarding parent of area 9594, child area=73, max8394.75 with child rect=231
Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar /home/rmast/tesseract/java/ScrollView.jar & wait"
ScrollView: Waiting for server...
Socket started on port 8461
Client connected
Adjusting row limits for block(2150,3188)
Row at 3183.004395 has min 3156.375000, max 3180.000000, size 23.625000
Row at 3137.788086 has min 3112.716309, max 3135.762695, size 23.046387
Row at 3183 yields spacing of 45.2163
Blob based spacing=(26.25,52.5), offset=33.177 row based=45.2163(0)
Estimate line size=26.25, spacing=52.5, offset=40.2881
Expanding bottom of row at 3137.788086 from 3132.026367 to 3131.225586
Expanding top of row at 3137.788086 from 3157.323975 to 3157.475586
Expanding bottom of row at 3183.004395 from 3176.692383 to 3176.441895
Expanding top of row at 3183.004395 from 3202.489014 to 3202.691895
Testing underline on blob at (2150,3149)->(2396,3188), base=3160
Occs:246 246 246
Testing underline on blob at (2150,3103)->(2396,3144), base=3085
Occs:0 0 246
Underlined blob at:Bounding box=(2150,3103)->(2396,3144)
Was:Bounding box=(2150,3103)->(2396,3144)
B:1 R:1 -- Can't do isolated row stats.
B:1 R:1 -- Inadequate certain spaces.
B:1 R:1 L:246-- Kn:3 Sp:12 Thr:7 -- Kn:3.00 (5) Thr:7 (12) Sp:12.00
B:1 R:2 L:186-- Kn:3 Sp:12 Thr:7 -- Kn:3.00 (5) Thr:7 (10) Sp:12.75
Row: Made 1 words in row ((2150,3149)(2396,3188))
Row: Made 4 words in row ((2150,3103)(2396,3144))
Rejected 2150 3149 2396 3188 Robert
Rejected 2164 3159 2175 3180 Robert
Rejected 2195 3154 2209 3176 Robert
Rejected 2212 3160 2220 3176 Robert
Rejected 2223 3160 2227 3176 Robert
Rejected 2223 3178 2227 3182 Robert
Rejected 2231 3160 2244 3176 Robert
Rejected 2247 3160 2256 3180 Robert
Good 2150 3103 2396 3144 Robert
Rejected 2164 3113 2175 3134 Robert
Good 2194 3115 2214 3130 Robert
Good 2217 3115 2221 3130 Robert
Good 2217 3132 2221 3137 Robert
Good 2224 3114 2237 3130 Robert
Good 2249 3121 2257 3125 Robert
Good 2269 3114 2283 3130 Robert
Good 2286 3114 2290 3137 Robert
Good 2293 3114 2307 3130 Robert
Good 2310 3114 2323 3130 Robert
Good 2327 3115 2336 3130 Robert
Row ending at (2336,3114.29): R=0.111111, dc=1, nc=9, ACCEPTED
cleanup_blocks: # rows = 1 / 2
cleanup_blocks: # blocks = 1 / 1
> wis -clear | wis - clear
The parent of the lower row appears to be kept alive, while the children of the upper row are all rejected as well.
Keeping the parent of the lower row alive makes the > rejected.
During this part of processing good is still good and rejected is still rejected (parents are rejected, children are coming by):
During Textord::filter_blobs (in 3.04.01/leptonica1.74.0 to get optimal performance of the ScrollView ) using -c textord_show_boxes=1:
diff --git a/textord/tordmain.cpp b/textord/tordmain.cpp
index 14cb7171..2a9c8815 100644
--- a/textord/tordmain.cpp
+++ b/textord/tordmain.cpp
@@ -272,9 +272,9 @@ void Textord::filter_blobs(ICOORD page_tr, // top right
if (to_win == NULL)
create_to_win(page_tr);
plot_box_list(to_win, &block->noise_blobs, ScrollView::WHITE);
- plot_box_list(to_win, &block->small_blobs, ScrollView::WHITE);
- plot_box_list(to_win, &block->large_blobs, ScrollView::WHITE);
- plot_box_list(to_win, &block->blobs, ScrollView::WHITE);
+ plot_box_list(to_win, &block->small_blobs, ScrollView::RED);
+ plot_box_list(to_win, &block->large_blobs, ScrollView::GREEN);
+ plot_box_list(to_win, &block->blobs, ScrollView::BLUE);
}
#endif // GRAPHICS_DISABLED
}
So dots and minuses remain noise, rejected parents are from now on called large blobs, the separate letters are just 'blobs'. I'm not sure if this path of translation is the only suspect translation done, as it isn't done on word level, but on block-level
Just killing the non-inverted parents in stepblob.cpp solves the issue for both lines:
> print
> wis - clear
diff --git a/src/ccstruct/stepblob.cpp b/src/ccstruct/stepblob.cpp
index 4c61b6c65..aac639747 100644
--- a/src/ccstruct/stepblob.cpp
+++ b/src/ccstruct/stepblob.cpp
@@ -209,7 +209,7 @@ void C_BLOB::ConstructBlobsFromOutlines(bool good_blob, C_OUTLINE_LIST *outline_
blob->CheckInverseFlagAndDirection();
// Put on appropriate list.
if (!blob_is_good && bad_blobs_it != nullptr) {
- bad_blobs_it->add_after_then_move(blob);
+ //bad_blobs_it->add_after_then_move(blob);
} else {
good_blobs_it->add_after_then_move(blob);
}
However, the question is whether there are examples of parents that may not be killed. With what conditions should (parts of) parents be preserved, and should those parents be inverted if their children are inverted as well?
Are there other paths leading to this !blob_is_good that I miss when I just kill everything as if they are parents of maintained children?
When the parents are left as in the original code during make_prop words there are much to much blobs per row left. For the '> print' row there are blobs that seem to represent the spaces, and at the end of the row there even seems to be some artificial spacing of 21 or 22 positions. Instead of the 8 expected blobs, of which 2 were already rejected there appear to be 19 blobs.
When traversing the bounding boxes of those blobs the spaces and some combinations of inverted letters seem to have made up some extra boxes. The better reading 'wis - clear' line doesn't contain such intermittent space-blobs, so I guess they're the uninverted revived parents cut at the spacing with their children.
For the '> wis - clear' row there are less superfluous blobnbox'es.
Of the blobnboxes that comprise the complete block
there are 2 new blobnboxes made up:
0x5555555acdd0: Complete revived parent. 0x55555559f630: Letter 'w', probably a fake-blob seeded duplicate from the rejected parent.
I'll just try to look whether killing the parents at the proposed spot appears to have unwanted side-effects...
I tried the effects of killing the parents on 5.1.0 with the full page using ocrmypdf.
ocrmypdf --image-dpi 300 --pdfa-image-compression lossless -O0 ../rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg formulierhocrjpgmetpatch5.1.0.pdf
For some reason the resulting selection from Adobe Acrobat Reader improves with this patch:
The second column 'Waarom dit formulier?' can be selected separately with my patched version, while selecting it on the original 5.1.0- version tries to select the second column in parallell and pastes the lines intermixed.
With 5.2.0 default settings the inverted Toelichting 2.1 is correctly read, however with none of the versions the bottom line with the ®-sign is complete.
Don't do your tests with PDF as output. Different PDF viewers can present the same file differently.
Yes, Zathura makes a mess of the selection, not clearly showing what lines are selected or not.
Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.
Outlook voor Android downloadenhttps://aka.ms/ghei36
From: zdenop @.> Sent: Thursday, July 21, 2022 6:15:28 PM To: tesseract-ocr/tesseract @.> Cc: rmast @.>; Author @.> Subject: Re: [tesseract-ocr/tesseract] Of two inverted top right texts one gets scanned double, the upper one disappears (Issue #3871)
We know that some parts are skipped at complex layout (table-like) images. Tesseract has just a basic document layout analysis.
Do your own layout segmentation for all complicated document layouts and store it in uzn file/each segment OCR individually. Also, I suggest the following docshttps://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md (black text on white background).
Here is an example of amitdo's test image.
tesseract inverted.png - --psm 4 UZN file inverted.uzn loaded.
wis - clear
i3871_inverted.ziphttps://github.com/tesseract-ocr/tesseract/files/9160801/i3871_inverted.zip
— Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/tesseract/issues/3871#issuecomment-1191685419, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5WP4TMDHCBS3GKWHITVVFZSBANCNFSM532FCG2Q. You are receiving this because you authored the thread.Message ID: @.***>
An error should not be blurred with manipulating the source-image until someone looking at it approves the result. Errors should be examined and solved, aiming at a Tesseract that operates unattended. At least for the purpose of the image compression Merlijn Wajer wants to reach at the internet archive.
Outlook voor Android downloadenhttps://aka.ms/ghei36
From: Amit D. @.> Sent: Thursday, July 21, 2022 4:27:25 PM To: tesseract-ocr/tesseract @.> Cc: rmast @.>; Author @.> Subject: Re: [tesseract-ocr/tesseract] Of two inverted top right texts one gets scanned double, the upper one disappears (Issue #3871)
After upscaling (4x)
[3871-ROI-x4]https://user-images.githubusercontent.com/13571208/180238807-43dcbfcc-ab3b-4779-9ac1-d5ca23ad1d47.png
output:
Wis - clear
— Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/tesseract/issues/3871#issuecomment-1191554064, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5TIOZ3KO6PBYEJZLULVVFM43ANCNFSM532FCG2Q. You are receiving this because you authored the thread.Message ID: @.***>
I'm investigating my issue earlier spotted in https://github.com/tesseract-ocr/tesseract/pull/3141 further.
In this picture above the text 'wis-clear' on the right, there is a text 'print'. This text print disappears completely and the text wis-clear has been read in twice.
Environment
Current Behavior:
Some inverted text on the top right disappears, other text gets scanned in twice.
There are two similar bounding boxes involved:
Expected Behavior:
Clearly readable text should be recognized without failure.
Suggested Fix: