naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.91k stars 2.21k forks source link

Disable non-text output formats by default #916

Open Balearica opened 5 months ago

Balearica commented 5 months ago

By default, 4 different output formats are produced: text, blocks, hocr, and tsv. It's safe to say that few if any users make use of more than one format. However, producing all 4 formats can significantly inflate runtime. This is especially true for blocks, which iterates individually over every symbol (and symbol choice) in the data, and retrieves information about them all.

I recently encountered an image where creating the blocks output took 12 seconds, whereas running recognition took just 10 seconds. While this is uncharacteristically long, it is unacceptable for a default option few users benefit from to inflate runtime >100% for any images. Even outside of this fringe case, testing on other documents shows that creating blocks often inflates runtime in the 0.25-0.50 second range when scanning documents, which is a non-trivial increase.

I think it makes sense to leave text on by default, as presumably this is the most used and quickest to render, and some output format needs to be enabled by default. However, other formats should not be enabled unless the user actually wants them.

This is a breaking change so it would need to wait until Tesseract.js v6. Restoring the previous behavior would simply be a matter of manually specifying formats in the output argument to worker.recognize.