Closed Shreeshrii closed 4 years ago
Questions:
Which is the correct command to use for segmentation?
If I want to train for segmentation for Devanagari images, what will be process to follow?
Is it possible to support tesseract generated ALTO files as input for segmentation training. I am attaching a sample file.
- Which is the correct command to use for segmentation?
There was an auto-switch regression where to new segmenter wasn't
selected despite the warning to the opposite. I fixed it a couple of
days ago in the trainable segmenter branch with region support but that
isn't ready for merging, yet. Just using the -bl
switch to explicitly
select the new segmenter is correct.
The blla.mlmodel you used was trained on an older state of the code and isn't compatible anymore. There will be another deprecation of existing models with the merging of the regions branch, necessitated by the explicit coding of line direction in there.
I've got some models trained on a larger dataset with augmentation soewhere. Let me dig them up.
- If I want to train for segmentation for Devanagari images, what will be process to follow?
The simplest way is to have a bunch of PageXML files or ALTOs with
baseline information (scheduled for standard inclusion with the next
revision. kraken/escriptorium output is compatible already, everything
else most likely not). Then you just point ketos segtrain
to them and
wait for a (long without GPU) while:
ketos segtrain -f page -N 100 -q dumb --augment -o seg_model *.xml
or for ALTO:
ketos segtrain -f alto -N 100 -q dumb --augment -o seg_model *.xml
There's also a legacy path
format which is just a JSON file with a
list of polylines.
MCC values of around 0.7+ on the validation set are decent. They are not correlated to the actual segmentation accuracy, i.e. quite a bit lower.
You use the output models as you've used the previous ones.
For your tesseract output: The ALTO is not compatible (no baselines) but I did a short test with semi-manually harvested training data of random archive.org documents using tesseract's hocr output (its segmenter is overall worse than the old kraken segmenter but its noise level is lower so there are more 'perfect' pages). There's a script converting hocr to the path format at [0]. You might want to use it to quickly produce training data. Be aware though that tesseract's baseline estimations can be quite a bit off; they are frequently placed 3-5 pixels below the actual one which can cause polygonization errors.
I've got some models trained on a larger dataset with augmentation somewhere. Let me dig them up.
Thanks, that will be great.
There's a script converting hocr to the path format at [0]. You might want to use it to quickly produce training data. Be aware though that tesseract's baseline estimations can be quite a bit off; they are frequently placed 3-5 pixels below the actual one which can cause polygonization errors.
This will be helpful in trying to build segmenter based on Devanagari. Is there a way to change the script to handle the error in tesseract's baseline calculation.
Questions: How many page images should I use for segtrain?
Can I use offsplit.mlmodel as base to continnue from?
This will be helpful in trying to build segmenter based on Devanagari. Is there a way to change the script to handle the error in tesseract's baseline calculation.
I haven't looked into how that error happens or if it is systematic and therefore fixable. That whole thing was an experiment of a few hours to see if I can bootstrap good enough training data by just rejecting all erroneous segmentation output of an existing OCR engine. The easiest way would be to simply adjust the baseline upwards/downwards by a few pixels, subject on tesseract's placement on the actual top Devanagari baseline or at the bottom. From our experiments with Hebrew both work and kraken doesn't really care where exactly the 'baseline' is; being somewhere between the actual base- and mean line is sufficient to get robust polygonization.
How many page images should I use for segtrain?
Anywhere between 50 and 400 seems to produce state of the art or slightly better results.
Can I use offsplit.mlmodel as base to continnue from?
Yes. Same syntax as with ketos train
, just input an existing model.
There will be another deprecation of existing models with the merging of the regions branch, necessitated by the explicit coding of line direction in there.
Does that mean that the Devanagari model I trained on line images cannot be used with the new development version of kraken?
The new segmenter fixes all that by being trainable but it works fundamentally differently and the
ketos transcribe/ketos extract
workflow won't be adapted for it.
Will there be another way to review/correct the page level ground truth?
Does that mean that the Devanagari model I trained on line images cannot be used with the new development version of kraken?
Recognition models trained on bounding box data do not work with baseline segmenter output. They are still supported but you're stuck with the old segmenter (and if you try anyway you get a scary warning and crap output). I've been looking into semi-supervised transfer learning and there might be an avenue to use that for adaptation without having to create new training data but it isn't a particularly high priority.
Segmentation models trained on blla
won't work anymore once I merge
the region detection code in blla_regions
back. As we mostly know
everybody who trained one up to now and the number is small there isn't
a particular need to preserve backward compatibility with a development
branch.
Will there be another way to review/correct the page level ground truth?
escriptorium is what we use for that but it is much more (digitization platform with annotation support). There probably won't be a replacement in kraken directly as nobody here wants to do it with a working alternative. If somebody else sends a pull request it can get merged though.
Thanks! I will wait for the new release.
The simplest way is to have a bunch of PageXML files or ALTOs with baseline information
Many available GT sets do not contain baseline information but rather polygonal line shapes. Is it possible to somehow generate baseline information from polygons? I.e. the reversal of the baseline to polygon transformation which is performed after the baseline detection.
Many available GT sets do not contain baseline information but rather polygonal line shapes. Is it possible to somehow generate baseline information from polygons? I.e. the reversal of the baseline to polygon transformation which is performed after the baseline detection.
Unfortunately not yet. On relatively clean writing or print you could
probably do a fairly reliable estimation with some filtering in line of
what the CenterNormalizer
in the old box processing pipeline does but
nobody has tried that as far as I know.
I did the following: gauss filter, find best rotation angle (with greatest horizontal maximum between two angles, e.g. -3 and +3), then the two dots of coordinates of maximum profile on both ends. Such baselines would always be straight, of course.
kraken, version 3.0.0.0b4.dev9
Since blla.model is not available, I tested using
offsplit.mlmodel
andcbad.mlmodel
that have been referred to in other posts, with an image with Devanagari script.Both models create the same output with above command.
However, when I add
--bl
to the above commands,offsplit.mlmodel
gives a different output whilecbad.mlmodel
gives an error.