internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

Make tools more usable with pipes #1

Open MerlijnWajer opened 3 years ago

MerlijnWajer commented 3 years ago

Many of the tools currently cannot work in special files in /dev/stdin in bash, or in general accept files from stdin, this is because of some unnecessary seeks.

Additionally, it would be nice to add some features to filter (for example) by word confidence. This could be done in hocr-text, but we could also have a streaming hocr filter tool that takes hocr as input, and also outputs hocr, but only allows words with certain confidence to pass. This would need to be streaming which makes it a little tricky, but it would be cool to for example pipe Tesseract output directly to such a tool.