naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
35.37k stars 2.23k forks source link

Parameters set using `createWorker` `config` argument overwritten by default arguments #975

Open Balearica opened 1 day ago

Balearica commented 1 day ago

The createWorker config argument allows for setting parameters prior to initialization. While this function was originally added to support a handful of init-only parameters (notably load_system_dawg, load_number_dawg, and load_punc_dawg), it should be able to support all parameters, and there is nothing in the documentation to indicate it only supports specific parameters.

However, at present, any settings provided in this config argument that conflict with the default parameters defined in defaultParams.js are overwritten by the defaults. It looks like this only impacts tessedit_pageseg_mode and tessedit_char_whitelist, as these are the only Tesseract parameters in the defaults file.

https://github.com/naptha/tesseract.js/blob/a936162d92b03bc04f51c4bfb5db14e588209838/src/worker-script/index.js#L308-L309

I will investigate the commit history before making a change, however I currently believe the code that sets the default Tesseract parameters can be cut entirely. Both values we are setting are already the defaults for the Tesseract API, so it's unclear why we are setting them manually.

Balearica commented 1 day ago

Upon a brief review, it looks like setting the default parameters here may have served a couple different purposes in the past.

  1. At specific points in this repo's history, our default arguments have been different from those of Tesseract
    1. E.g. this version of the file sets user_defined_dpi to 300, which is not a default behavior.
  2. A previous version of the repo combines the defaults with user-defined parameters, which makes much more sense than what happens now.
    1. https://github.com/naptha/tesseract.js/blob/de4b98ae23202929471ae8483939c009e5f421b0/src/common/workerUtils.js#L63-L84
    2. I don't think it's necessary to implement this, however, as I do not believe our defaults are any different from the Tesseract defaults.

I now am fairly confident that this can be cut without consequence, so will do so.

Balearica commented 1 day ago

If we cut the settings discussed above, the only thing left in the defaultParams.js file is the tessjs_create_hocr/tessjs_create_tsv/etc. settings that were depreciated multiple major releases ago. Therefore, we should be able to cut that entire file. The only thing to confirm is that the default output formats stay the same before/after, as otherwise this would be a breaking change.