Closed bertsky closed 3 years ago
- Or we integrate dinglehopper's HTML and display it via WebKit directly.
…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper
on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.
Still, creating comparisons on the fly (without the need to run
ocrd-dinglehopper
on the complete workspace) would be preferable IMHO
I haven't tested it, but it should be possible to use -g
to just process one page. I have also some speed improvements planned, so I guess that should help too.
I haven't tested it, but it should be possible to use
-g
to just process one page.
The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.
So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...
There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.
Very nice, here's how that looks, comparing calamari/tesseract output from ocrd-galley:
Closed by #29
This is clearly a desideratum here, but how do we approach it?
Considerations:
FileGroupSelector
s instead of 1O(n²)
(orO(n³)
under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them areaͤ
vsä
orſt
vsſt
or evenſ
vss
.GtkSource.LanguageManager
, an off-the-shelf highlighter that would lend itself isdiff
(coloringdiff -u
line output). But this does not colorize within the lines (likegit diff --word-diff
,wdiff
,dwdiff
etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.