Closed PonteIneptique closed 4 years ago
Sorry, I don't really understand what you're trying to achieve. Are you providing multiple external segmentations to the ocr subcommand when defining more than one input file? Or are you running multiple models on the same data? The first is definitely something we should implement, although I'd prefer a format string similar to how ketos extract
output can be defined.
The first to do the second later :)
Ah, to do the second use the ketos test
command if you want to just compare the accuracy of multiple models. It will be fairly slow though because it re-encodes/re-normalizes the complete dataset for each model.
If you want to create a pull request of your code I can merge it. Ideally following a similar format to extract, we've got way too much slightly different behavior anyway.
I'll PR ASAP. For my precise usecase, it's not testing per se : I actually have a manuscript that has a free interpretation of spaces, which makes it difficult even for human reader to decide. Some time, space were inserted in the transcription data where it should not have, sometime it did not. For further quantitative research, we created two models : one with space character, one without ; hence the need to run both models later (and not go through segmentation twice).
I assume it's something in scripta continua or simlar?
For further quantitative research, we created two models : one with space character, one without ; hence the need to run both models later (and not go through segmentation twice).
The segmentation is deterministic so it running it twice doesn't really hurt except being an additional expenditure of CPU cycles.
Being an additional expenditure of CPU cycles.
That's what I am trying to avoid. :) Better not run it every time I update my model. :)
I assume it's something in scripta continua or similar?
More or less, yes. There is space, but not always.
@mittagessen Kindly close the topic
With the region code of the new segmenter we've got the ability to input multiple ALTO/PageXML files to supply existing segmentations to the ocr submodule.
The idea would be to be able to run segmentation once and then be able to run different models using a globlike input with a suffix for segmentation.
Right now, I deal with this with like that