naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.91k stars 2.21k forks source link

Add line size metrics (ascender, descender, size) to `line` objects in `blocks` output #906

Closed Balearica closed 6 months ago

Balearica commented 6 months ago

There is currently no easy way to retrieve accurate line size metrics using Tesseract.js. Several sub-optimal ways of accomplishing this are listed below.

The ascender/descender/row_height metrics from the hocr output should be added to the blocks output format. This will allow for easily retrieving accurate data about line size.

Balearica commented 6 months ago

There is a RowAttributes getter in Tesseract, however it is not accessible through Tesseract.js-core because of how recently it was added. Therefore, implementing this will require a new minor version of Tesseract.js-core. The implementation should be sure not to break code for users with old versions of Tesseract.js-core.