add OCR alignment and difference view

bertsky commented 4 years ago

This is clearly a desideratum here, but how do we approach it?

Considerations:

The additional view would need 2 FileGroupSelectors instead of 1
There are 2 cases:
- A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
- B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has O(n²) (or O(n³) under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are
- suited for Unicode (or arbitrary lists of objects),
- robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
- actually efficient (in terms of average complexity or best case complexity)
- well maintained and packaged.
For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g. aͤ vs ä or ﬅ vs ſt or even ſ vs s.

It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:

ocrd-dinglehopper ocrd-cor-asv-ann-evaluate

CER and WER and visualization only CER (currently)

only single pages aggregates over all pages

result is HTML with visual diff + JSON report result is logging

alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)

uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well

a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin) that targets GT level 1 (because NFKC is both quite incomplete and too much already)

text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only)

compares 1:1 compares 1:N

Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For GtkSource.LanguageManager, an off-the-shelf highlighter that would lend itself is diff (coloring diff -u line output). But this does not colorize within the lines (like git diff --word-diff, wdiff, dwdiff etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.

ocrd-dinglehopper	ocrd-cor-asv-ann-evaluate
CER and WER and visualization	only CER (currently)
only single pages	aggregates over all pages
result is HTML with visual diff + JSON report	result is logging
alignment written Python (slow)	difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)
uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects)	calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well
a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable)	offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called `historic_latin`) that targets GT level 1 (because NFKC is both quite incomplete and too much already)
text alignment of complete page text concatenated (suitable for A or B)	text alignment on identical textlines (suitable for B only)
compares 1:1	compares 1:N

bertsky commented 4 years ago

Or we integrate dinglehopper's HTML and display it via WebKit directly.

…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.

mikegerber commented 3 years ago

Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO

I haven't tested it, but it should be possible to use -g to just process one page. I have also some speed improvements planned, so I guess that should help too.

bertsky commented 3 years ago

I haven't tested it, but it should be possible to use -g to just process one page.

The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.

So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...

hnesk commented 3 years ago

There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.

kba commented 3 years ago

Very nice, here's how that looks, comparing calamari/tesseract output from ocrd-galley:

hnesk commented 3 years ago

Closed by #29

hnesk / browse-ocrd

add OCR alignment and difference view #13