Open MerlijnWajer opened 3 years ago
The standard models were trained with long lines of text, so maybe they simply did not teach Tesseract how very short text lines (like page numbers) look like. This can be tested with new models like frak2021 which were at least trained with some line numbers and other short lines.
I tried frak2021 with some of the test images and had the same issue: none of the page numbers was recognized.
Binarisation is not the cause, even if it looks not optimal when I output tessinput.tif
with command line option -c tessedit_write_images=true
:
With default --psm
and -l eng
for sim_architectural-record_1931-12_70_6_0027.jpg
:
A CONGREGATIONAL CHURCH IN
Fuermann and Sons
Outward thrust of roof trusses is counteracted by steel struts in the walls with steel rods going
under the floor slab between struts, thus effecting a saving in masonry.
With default --psm 6
the headline is missing and one speckle added as comma:
[52 lines of noise from the images]
Fuermann and Sons
, Outward thrust of roof trusses is counteracted by steel struts in the walls with steel rods going
under the floor slab between struts, thus effecting a saving in masonry.
416 DECEMBER, 1931
The default psm
detects "something" in sim_architectural-record_1931-12_70_6_0027.psmno.hocr
but it looks like a bug and generates empty lines:
<div class='ocr_carea' id='block_1_4' title="bbox 0 0 3356 4750">
<p class='ocr_par' id='par_1_4' lang='eng' title="bbox 0 0 3356 4750">
<span class='ocr_line' id='line_1_5' title="bbox 0 0 3356 4750; baseline 0 0; x_size 2375; x_descenders -1187.5; x_ascenders 1187.5">
<span class='ocrx_word' id='word_1_37' title='bbox 0 0 3356 4750; x_wconf 95'> </span>
</span>
</p>
</div>
It's not the model. eng
and deu
from tessdata_best
are very fine for modern texts after ~1900.
It clearly is caused by the weak page segmentation. Seems to be confused by the footer area not in line with the vertical borders of the body. IMHO page segmentation needs a complete redesign and should be trainable from ground truth layout patterns. It also should recognise and mark different zones like text, tables, captions, header, footer, halftone, drawings, decorative separators etc.
There are two work-arounds:
psm
and a second time with psm 6
. Then merge the results.This is a small single line image which I made from the examples and which also shows the issue:
Tesseract has no problems with this image if the black border on top or bottom is removed.
This is a small single line image which I made from the examples and which also shows the issue:
Interesting. With this image tessinput.tif
is negated. But that's not the bug. With a negated input image and the then white bars cropped away the result is perfect. Cropping one bar away gives only the page number as result.
Seems the region classification (text area detection) has a serious problem. I guess it is full with opinionated default options and thresholds.
Tried to boil it down a little bit with the image https://archive.org/~merlijn/tesseract-pagenumbers/sim_biblical-theology-bulletin_spring-1990_20_1_0004.jpg. It has good quality and influence of bad image quality can be excluded.
First tried with different modes of --psm
. Only --psm 6
detects the page number, but has some noise coming from the dark scan border.
Good recognition but no page number: psm 1, 3, 4, 11, 12.
Just noise: 5.
Recognises multiline initial M
(of "Many") and keeps any
: 11, 12.
Then tried a cut-out of the page number alone:
Recognises page number: psm 6, 7, 10.
Empty result: 4, 11, 12.
Result o
: 5.
Result ee ee ee
(Diplopia?): 8, 13.
--psm 12
of pagenumber only gives some messages:
$ tesseract biblical.page.jpg biblical.page.psm12 --psm 12 --tessdata-dir /usr/local/share/tessdata txt hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-773-gd33ed with Leptonica
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 1 blob text block, but using orientation anyway: 0
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 1 blob text block, but using orientation anyway: 0
This means different treatment by psm
, which is intended, but ignores wrongly ignores small text.
Maybe it's related to PSM_COL_FIND_ENABLED
:
$ grep -R PSM_COL_FIND_ENABLED src
src/textord/colfind.cpp: if (!PSM_COL_FIND_ENABLED(pageseg_mode)) {
src/ccmain/pagesegmain.cpp: if (!PSM_COL_FIND_ENABLED(pageseg_mode) &&
src/ccmain/pagesegmain.cpp: * If !PSM_COL_FIND_ENABLED(pageseg_mode), then no attempt is made to divide
src/ccmain/pagesegmain.cpp: if (!PSM_COL_FIND_ENABLED(pageseg_mode)) v_lines.clear();
$ grep -R PSM_COL_FIND_ENABLED include
include/tesseract/publictypes.h:inline bool PSM_COL_FIND_ENABLED(int pageseg_mode) {
PSM_COL_FIND_ENABLED
means psm = 1 .. 3
.
PSM_SPARSE
means psm = 11 .. 12
PSM_BLOCK_FIND_ENABLED
means psm = 1 .. 4
.
This https://github.com/tesseract-ocr/tesseract/blob/0fb170b994e59aae4572cde4cd0562c869e4713a/src/ccmain/pagesegmain.cpp#L137 excludes --psm 6
:
if (PSM_OSD_ENABLED(pageseg_mode) || PSM_BLOCK_FIND_ENABLED(pageseg_mode) ||
PSM_SPARSE(pageseg_mode)) {
auto_page_seg_ret_val =
AutoPageSeg(pageseg_mode, blocks, &to_blocks,
enable_noise_removal ? &diacritic_blobs : nullptr, osd_tess, osr);
if (pageseg_mode == PSM_OSD_ONLY) {
return auto_page_seg_ret_val;
}
But disabling enable_noise_removal
has no effect.
Must be elsewhere.
Here is another case that might help: https://archive.org/~merlijn/tesseract-pagenumbers/effectivenessoft00rick_0033.jpg
the entire line Insert Table 6 about here
is missing. Presumably because it is below a (detected) line separator, perhaps?
This happens a lot with scanned major papers and thesis materials on my campus, for example the page number here. We have a simple script that uses OpenCV to pick up missed regions but it would be great if there was a flag or command line parameter to get Tesseract to target this kind of thing directly.
Thanks for this nice script. Do you think that the same processing could be added to Tesseract, but based on Leptonica instead of OpenCV and using C++ of course?
This might be possible, I have done a little C++ work with Leptonica on newspaper layouts, and I have been flummoxed by page numbers there too, but I am sure it's more a matter of fine-tuning. It's a bit beyond my usual tinkering, but I will take a look.
As near as I can tell, Leptonica is used for identifying photos in images in Tesseract and some data functions, but the segmentation is based on a custom connected components approach. It's possible to get a bit more by reducing kMinMediumSizeRatio in blobbox.cpp, for example, changing it from 0.25 to 0.1 , will pick up @MerlijnWajer's "Insert Table 6 about here" text but it's some intricate plumbing to modify beyond that. Post-processing might actually be easier and opens the door to leveraging the confidence value for deciding what's cruft and what's valuable.
@MerlijnWajer
Here is another case that might help: https://archive.org/~merlijn/tesseract-pagenumbers/effectivenessoft00rick_0033.jpg
the entire line
Insert Table 6 about here
is missing. Presumably because it is below a (detected) line separator, perhaps?
In this case it works with --psm 6
:
$ tesseract effectivenessoft00rick_0033.jpg - --psm 6 --tessdata-dir /usr/local/share/tessdata
Warning: Invalid resolution 0 dpi. Using 70 instead.
Ricker 13.
their camp time. This was found to be significant at the .001
level of confidence.
Insert Table 6 about here
Comparisons were also made between the ex-camper and
[...]
But I would expect that Tesseract outputs the line separators, because they are characters.
It should not be necessary to play around with --psm
or other command line options.
@stweil
Thanks for this nice script. Do you think that the same processing could be added to Tesseract, but based on Leptonica instead of OpenCV and using C++ of course?
IMHO it's not necessary to code this logic because it's still there, but the thresholds are wrong or not adaptive to different layouts.
As near as I can tell, Leptonica is used for identifying photos in images in Tesseract and some data functions, but the segmentation is based on a custom connected components approach. It's possible to get a bit more by reducing kMinMediumSizeRatio in blobbox.cpp, for example, changing it from 0.25 to 0.1 , will pick up @MerlijnWajer's "Insert Table 6 about here" text but it's some intricate plumbing to modify beyond that. Post-processing might actually be easier and opens the door to leveraging the confidence value for deciding what's cruft and what's valuable.
I will toy around a bit with this number. I wonder if it ought to be based on DPI some how (maybe it is, with the line_size
).
For example, if I take the sim_architectural-record_1931-12_70_6_0027
image and resize it to 1000x1385
px (from 3432x4752
px), tesseract 4.1.1
(and also 5.0.0-rc1
) picks up the page number without problems. This also seems to suggest that @wollmers was on to something when he suggested to downscale images to ~300 DPI for better accuracy.
The same holds for the https://archive.org/~merlijn/tesseract-pagenumbers/effectivenessoft00rick_0033.jpg image - if I resize it to 1000x1339
px, it finds both the page number and the "Insert Table 6 about here" text with the default psm.
It doesn't seem to necessarily work for the other images though, so maybe I got excited a bit too quickly. :-)
I made some progress. It looks like in the example of https://archive.org/~merlijn/tesseract-pagenumbers/sim_biblical-theology-bulletin_spring-1990_20_1_0004.jpg with default segmentation - the '3' is never accepted as a blob that is text. I found out that this is the case because the blob disappears after the call to TidyBlobs()
:
In particular, the call to block->DeleteUnownedNoise();
is what removes the 3. Commenting that call makes the '3' visible in the blobs, but of course this doesn't help since it wasn't detected by the column finder, per the hint in that function:
// Deletes noise blobs from all lists where not owned by a ColPartition.
So somehow it is not picked up by the page segmentation. The "Image blobs" window shows this:
void TO_BLOCK::plot_noise_blobs(ScrollView *win) {
BLOBNBOX::PlotNoiseBlobs(&noise_blobs, ScrollView::RED, ScrollView::RED, win);
BLOBNBOX::PlotNoiseBlobs(&small_blobs, ScrollView::RED, ScrollView::RED, win);
BLOBNBOX::PlotNoiseBlobs(&large_blobs, ScrollView::RED, ScrollView::RED, win);
BLOBNBOX::PlotNoiseBlobs(&blobs, ScrollView::RED, ScrollView::RED, win);
}
And:
void BLOBNBOX::PlotNoiseBlobs(BLOBNBOX_LIST *list, ScrollView::Color body_colour,
ScrollView::Color child_colour, ScrollView *win) {
BLOBNBOX_IT it(list);
for (it.mark_cycle_pt(); !it.cycled_list(); it.forward()) {
BLOBNBOX *blob = it.data();
if (blob->DeletableNoise()) {
blob->plot(win, body_colour, child_colour);
}
}
}
So red indicates it is seen as noise, indeed. Deletetable noise looks as follows:
bool DeletableNoise() const {
return owner() == nullptr && region_type() == BRT_NOISE;
}
So this means the '3' both doesn't have an "owner" (presumably the ColPartition mentioned earlier) and also has region type noise. There are several places in the code that set the BRT_NOISE
value, so going over those seems like the next logical step.
Changing this code here:
https://github.com/tesseract-ocr/tesseract/blob/main/src/textord/colpartition.cpp#L1308
if (flow_ == BTFT_NEIGHBOURS) {
// Check for noisy neighbours.
if (noisy_count >= blob_count) {
flow_ = BTFT_NONTEXT;
blob_type_ = BRT_NOISE;
}
}
To this:
if (flow_ == BTFT_NEIGHBOURS) {
// Check for noisy neighbours.
if (noisy_count >= blob_count) {
flow_ = BTFT_NONTEXT;
blob_type_ = BRT_NOISE;
} else {
flow_ = BTFT_STRONG_CHAIN;
blob_type_ = BRT_TEXT;
}
}
makes the sim_biblical-theology-bulletin_spring-1990_20_1_0004.jpg
example work, as well as the dumbed down version from @wollmers here: https://user-images.githubusercontent.com/1275557/130783964-38b8ce13-ec0f-4f23-94e2-d4ef511c149c.jpg
Of course, this is not a real fix, as it hampers real noise detection. Setting it to BTFT_CHAIN
is not enough, so clearly something is just discarding these entirely as not enough to qualify as a real "column". It doesn't help with this example here: https://archive.org/~merlijn/tesseract-pagenumbers/effectivenessoft00rick_0033.jpg - suggesting the issue of "Insert Table 6 about here" not being found is unrelated to this (detected as noise?) issue, and rather another segmentation problem (evident by the fact that I reported it to work when the image is downscaled some)
So my understanding is that the value ("score", or "textlineliness") of the page numbers is very low. They do not pass kMinChainTextValue
and definitely not kMinStrongTextValue
, and also the strong_score
will be particularly low. (In ColPartition::SetRegionAndFlowTypesFromProjectionValue
).
The "textlineliness" value seems to come from TextlineProjection::EvaluateColPartition
. All of this makes me think that the segmentation code in general doesn't seem to particularly like page numbers or small parts of floating text (which I suppose makes sense), but then I'm left wondering in particular what it is that makes it work in other cases. Maybe it works OK when page numbers align with some other text, making it a strong "text line"?
Looks like it, see this example where it finds the page number on the original image, but with the rest of the title (on the same "line" as the page number) removed, it no longer finds it:
$ tesseract ../pagenumbers-bug/sim_canadian-medical-association-journal_1963-03-16_88_11_0006.png - | grep 549
Estimating resolution as 474
Detected 27 diacritics
LisTERIA MONOCYTOGENES INFECTIONS 549
x $ tesseract ../pagenumbers-bug/sim_canadian-medical-association-journal_1963-03-16_88_11_0006_titlecrop.png - | grep 549
Detected 27 diacritics
(above images can be found here: https://archive.org/~merlijn/tesseract-pagenumbers/works-with-title/)
Some more work might need to be put in understanding this further, but I am not sure how to continue:
sim_biblical-theology-bulletin_spring-1990_20_1_0004.jpg
there are two other (to me clearly noise) blobs that get OCR'd with the above change, and the OCR per character confidence in those is higher than in the page number.Maybe other examples where it works just fine can also provide some more hints as to how the author of the code assumed this was supposed to work. To me it looks like it really wants text to be part of a column/partition of clear text line, and otherwise it will discard them.
Maybe a start could be to draw a rectangle around things that are clearly detected as columns, and optionally accept noise not contained without those columns / text regions. That could be a pass that runs after the column detection.
@stweil @amitdo @wollmers and others - any ideas or suggestions on how to continue, given the above info/options? The basic problem is deciding what is noise and what is a page number. Currently as part of the "column finding", page numbers are typically treated as noise as they don't usually come in a column/line/region, so they don't get OCR'd in the normal (automated) psm.
I'd be happy to try to pursue some solution, I'm just not sure what would make the most sense.
@MerlijnWajer
I can only guess from my own observations and what it should do (e.g. in post-correction to detect, cut out and re-OCR candidate areas).
It's poor page segmentation which in Tesseract is a big heap of heuristic parameters and thresholds. They are not documented in a theoretical paper or a specification.
Your idea of detecting the print area (German: Satzspiegel) is obvious and used in other OCR-related tools.
A layout detection could have an "opinionated" guess about type of page (title page, content, preface, main, index). E.g. title pages are usually not numbered and have center layout, the next pages have often Roman numerals. Page numbers in most traditional documents are in the head or foot of the page, left, right or center. All are outside the main text block.
In your last example https://archive.org/~merlijn/tesseract-pagenumbers/works-with-title/sim_canadian-medical-association-journal_1963-03-16_88_11_0006_titlecrop.png I can't see what confuses Tesseract because
The only "special" property is, that it's aligned to the right edge of the text 2-column block.
The long term future (Version 7?) should be to use trained models for OLR (Optical Layout Recognition) using NNs. They are reported to reach 80% accuracy compared to 50% of ABBY or Tesseract.
@MerlijnWajer
Now tried your example and get the page number with --psm 6
:
$ tesseract sim_canadian-medical-association-journal_1963-03-16_88_11_0006_titlecrop.png titlecrop.psm6 \
-l eng --psm 6 --tessdata-dir /usr/local/share/tessdata txt pdf hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-773-gd33ed with Leptonica
The content of titlecrop.psm6.txt
(first lines):
549
Listeria Monocytogenes Infections in Metropolitan Toronto
A Clinicopathological Study
A. H. SEPP, M.D.* and T. E. ROY, M.D.,+ Toronto
nce the classical description of Listeria mono- [7
cytogenes (LM) by Murray, Webb and Swann
in 1926,1 much information has been collected, but ABSTRACT
The same without psm
:
$ tesseract sim_canadian-medical-association-journal_1963-03-16_88_11_0006_titlecrop.png titlecrop.nopsm \
-l eng --tessdata-dir /usr/local/share/tessdata txt pdf hocr
Tesseract Open Source OCR Engine v5.0.0-alpha-773-gd33ed with Leptonica
Detected 27 diacritics
Content of titlecrop.nopsm
:
Listeria Monocytogenes Infections in Metropolitan Toronto
A Clinicopathological Study
A. H. SEPP, M.D.* and T. E. ROY, M.D.,+ Toronto
GQINCE the classical description of Listeria mono-
cytogenes (LM) by Murray, Webb and Swann
in 1926, much information has been collected, but
only the last 10 years have seen a renewed interest
in this fascinating organism. In spite of the vast
and rapidly accumulating literature, human in-
fection is still widely regarded as a rarity.
Here the page number in titlecrop.psm6.hocr
:
<span class='ocr_line' id='line_1_1' title="bbox 3060 456 3145 500; baseline 0 0; x_size 60.694389; x_descenders 13.544555; x_ascenders 17.736841">
<span class='ocrx_word' id='word_1_1' title='bbox 3060 456 3145 500; x_wconf 96'>549</span>
</span>
Compared to titlecrop.nopsm.hocr
:
<span class='ocr_line' id='line_1_1' title="bbox 0 586 3450 868; baseline 0 -78; x_size 272.66666; x_descenders 68.166664; x_ascenders 68.166664">
<span class='ocrx_word' id='word_1_1' title='bbox 0 586 549 790; x_wconf 95'> </span>
<span class='ocrx_word' id='word_1_2' title='bbox 3006 586 3450 868; x_wconf 95'> </span>
</span>
Disclaimer: I used my oldish version. You should retry it with a newer or the latest one.
IMHO you should compare the settings of the different psm modes in the source code. I did this some time ago but can't find the tread. It's here under issues. Look also into hOCR and compare the difference of the bounding boxes. Maybe the "good" settings of --psm 6
can be merged into the default psm and vice versa.
I don't think that --psm 6
does column finding (or even a good job at the segmenting complex pages), so I discounted that as a solution to the problem with page numbers the automated segmentation has. The reading order and proper column detection is very important for my/our OCR process, so we cannot sacrifice that. I don't think the constants/settings matter here, as describe above, it's the column finding that disregards the page numbers.
any ideas or suggestions on how to continue, given the above info/options?
I don't have any suggestion.
It's hard to improve Tesseract's layout analysis code since it very complex. As you can see, you can easily improve it for some documents but make it worse for other documents. Thus, as a first step I think we need: 1) A good tool to evaluate page segmentation output. 2) A diverse dataset of images with accurately labeled page segmentation info. 3) Implement #3749.
any ideas or suggestions on how to continue, given the above info/options?
I don't have any suggestion.
It's hard to improve Tesseract's layout analysis code since it very complex. As you can see, you can easily improve it for some documents but make it worse for other documents. Thus, as a first step I think we need:
1. A good tool to evaluate page segmentation output. 2. A diverse dataset of images with accurately labeled page segmentation info. 3. Implement [Feature request: Proper output for PSM 2 #3749](https://github.com/tesseract-ocr/tesseract/issues/3749).
I think all of these points are valuable, but there is one idea I'd like to throw out there...
I was wondering if it makes sense to have the noise in the results, but just marked with the appropriate poly block type (PT_NOISE). (We would have to update the renderers) This would allow users to enable the output of noise and perform any filtering themselves. It doesn't necessarily solve the shortcomings in the page segmentation code, but it at least gives the users an easy way to get access to the data that otherwise would be thrown away.
For example, in this case the majority of the noise can easily be filtered once recognized (OCR'd), by simply only accepting the Latin numeral characters with a certain confidence.
I tried to take a stab at implementing this, but got lost in the code. I suppose it decides to make columns out of the block somewhere, and just only passes those along, and this is where the noise ultimately gets "lost". There are many places that attempt to delete noise.
colfind.h
has some info on the process, but most of the documentation seems to require intimate knowledge of much of the segmentation code.
Just following up, I implemented something along the lines of what @artunit implemented in the following patch below. It's clearly a hack, but it works well for my purposes it seems.
The idea is to just blank out any regions already analysed, and then use a different page segmentation mode to find the remaining parts. In this case, any extra details written as a separate 'page', so I came up with this tool to merge the two back into one hOCR page: https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/hocr-flatten-pages
The additional downside here can be that the reading order is not entirely preserved, but it is performant and seems to pick up page numbers much more frequently.
From acc00bb647540c424fabad6a98d37fa677008ef2 Mon Sep 17 00:00:00 2001
From: Merlijn Wajer <merlijn@wizzup.org>
Date: Mon, 9 Jan 2023 17:24:23 +0100
Subject: [PATCH] tesseract: perform double analysis on each image
This change will cause the main tesseract program to analyse every image
twice. Once with the given page segmentation mode, and then once with a
single block page segmentation mode. The second run runs on a modified
image where any earlier blocks are turned black, causing Tesseract to
skip them for the second analysis.
Currently two pages are output for a single image, so this is clearly a
hack, but it's not as computationally intensive as running two full
runs.
(In fact, it might add as little as ~10% overhead, depending on the
input image)
WARNING: This will probably break weird non-filepath file input patterns
like "-" for stdin, or things that resolve using libcurl.
---
src/tesseract.cpp | 36 +++++++++++++++++++++++++++++++++++-
1 file changed, 35 insertions(+), 1 deletion(-)
diff --git a/src/tesseract.cpp b/src/tesseract.cpp
index e0697aa7..0bde1cdc 100644
--- a/src/tesseract.cpp
+++ b/src/tesseract.cpp
@@ -825,7 +825,41 @@ int main(int argc, char **argv) {
fprintf(stderr, "%s", osd_warning.c_str());
}
#endif
- bool succeed = api.ProcessPages(image, nullptr, 0, renderers[0].get());
+
+
+ Pix *pix = pixRead(image);
+ auto renderer = renderers[0].get();
+ renderer->BeginDocument("TODO");
+ //document_title.c_str());
+
+ bool succeed = api.ProcessPage(pix, 0, image, NULL, 0, renderers[0].get());
+ //bool succeed = api.ProcessPages(image, nullptr, 0, renderers[0].get());
+
+ {
+ Boxa* default_boxes = api.GetComponentImages(tesseract::RIL_BLOCK, true, nullptr, nullptr);
+
+ //pixWrite("/tmp/out.png", pix, IFF_PNG);
+ //Pix *newpix = pixPaintBoxa(pix, default_boxes, 0);
+ Pix *newpix = pixSetBlackOrWhiteBoxa(pix, default_boxes, L_SET_BLACK);
+ //pixWrite("/tmp/out_boxes.png", newpix, IFF_PNG);
+
+ api.SetPageSegMode(PSM_SINGLE_BLOCK);
+ //api.SetPageSegMode(PSM_SPARSE_TEXT);
+ api.SetImage(newpix);
+
+ api.Recognize(NULL);
+
+ // TODO: error handling
+ renderer->AddImage(&api);
+
+ boxaDestroy(&default_boxes);
+ pixDestroy(&newpix);
+ }
+
+ pixDestroy(&pix);
+
+ renderer->EndDocument();
+
if (!succeed) {
fprintf(stderr, "Error during processing.\n");
ret_val = EXIT_FAILURE;
--
GitLab
@MerlijnWajer As I understand, you blank out the regions in the image, which is some sort of hack.
Another approach would be, to use the data structure of the page and resegment the empty regions. Needs some method to calculate this "empty region", then virtually cut them out and OCR.
Merging could also be done in the data structure. If you know Page-XML they have IDs for every block (paragraph, line etc.), which just need to be unique and an other attribute for reading-order. Thus on insert of a new element, only the reading order needs renumbering. This would maybe a massive change of data structures in Tesseract. But thanks for addressing it. Now I know what I should add as attribute in my internal data structure for post-correction.
On document level you can learn the positions of page numbers (footer or header, left, center, right) and search in this regions, if there is "something" in the image. Also possible for single pages but less accurate.
If you find a way to do some of this, I'd love to hear it. I've tried to fix/change the page segmentation code and I got stuck numerous times, so I'm not sure if I'll be able to contribute a better solution.
Environment
4.1.1
,5.0.0 v20201231
Current Behavior:
In some cases, Tesseract fully automatic page segmentation does not pick up page numbers that are quite visible. Here is an example (as hocr result viewer):
https://archive.org/services/hocr-view/view?identifier=sim_architectural-record_1931-12_70_6&pageno=27
I have taken the liberty of hosting some images with this problem here (and can attempt to surface more if that is helpful):
https://archive.org/~merlijn/tesseract-pagenumbers/
I don't believe that the problem is related to binarisation. I've tried to run the Java viewer (https://tesseract-ocr.github.io/tessdoc/ViewerDebugging.html) to look at the results, but didn't become much wiser, as it simply shows the page numbers not being picked up.
Expected Behavior:
Tesseract picks up the page number as well.