hnesk / browse-ocrd

An extensible viewer for OCR-D mets.xml files
MIT License
20 stars 9 forks source link

add OCR alignment and difference view #13

Closed bertsky closed 3 years ago

bertsky commented 4 years ago

This is clearly a desideratum here, but how do we approach it?

Considerations:

  1. The additional view would need 2 FileGroupSelectors instead of 1
  2. There are 2 cases:
    • A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
    • B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
  3. The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has O(n²) (or O(n³) under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are
    • suited for Unicode (or arbitrary lists of objects),
    • robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
    • actually efficient (in terms of average complexity or best case complexity)
    • well maintained and packaged.
  4. For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g. vs ä or vs ſt or even ſ vs s.
  5. It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:
    ocrd-dinglehopper ocrd-cor-asv-ann-evaluate
    CER and WER and visualization only CER (currently)
    only single pages aggregates over all pages
    result is HTML with visual diff + JSON report result is logging
    alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)
    uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well
    a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin) that targets GT level 1 (because NFKC is both quite incomplete and too much already)
    text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only)
    compares 1:1 compares 1:N
  6. Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For GtkSource.LanguageManager, an off-the-shelf highlighter that would lend itself is diff (coloring diff -u line output). But this does not colorize within the lines (like git diff --word-diff, wdiff, dwdiff etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.
bertsky commented 4 years ago
  1. Or we integrate dinglehopper's HTML and display it via WebKit directly.

…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.

mikegerber commented 3 years ago

Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO

I haven't tested it, but it should be possible to use -g to just process one page. I have also some speed improvements planned, so I guess that should help too.

bertsky commented 3 years ago

I haven't tested it, but it should be possible to use -g to just process one page.

The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.

So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...

hnesk commented 3 years ago

There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.

kba commented 3 years ago

Very nice, here's how that looks, comparing calamari/tesseract output from ocrd-galley:

image

hnesk commented 3 years ago

Closed by #29