naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.9k stars 2.21k forks source link

Regression since 2.1.5 on img tag #723

Closed Jimmy-Z closed 1 year ago

Jimmy-Z commented 1 year ago

I was trying to revive this user script: https://greasyfork.org/en/scripts/416140-rarbg-threat-defence-bypasser it was using 2.1.4, and if switched to later versions other than 2.1.5, it will complain truncated file.

It was trying to recognize text in an img tag, I found out that src url can only be retrieved once, after that you only get an empty file, my guess is 2.x supported reading img directly from DOM, while later versions will try to retrieve image file from that src again.

If you want to test that, you might not want to though since it's about accessing some shady website, I'll just describe it here:

  1. go to - (will be referred as "the link" from now on), you will be redirected to a "threat_defence" page
    • if you refresh that threat defence page, the image will no longer show, because of the quirk I mentioned earlier, you'll have to go back to the link again
    • if you passed that check and wanted to test again, clear cookies
  2. load the script, l use Violentmonkey
  3. access the link you'll see it works
  4. then change @require to use later versions, clear cookies, test again, you'll see error log in browser console.
Balearica commented 1 year ago

Using an example image in this repo, I tried recognizing the same <img> tag multiple times and did not have any issues.

If you would like this followed up on, please provide a standalone reproducible example of the bug you describe (using only Tesseract.js and an example image). Any of our examples could be used as starting points. As a general rule, I do not attempt to troubleshoot other projects as there are too many moving parts unrelated to Tesseract.js.

Jimmy-Z commented 1 year ago

It's not reproduce-able in a static test, like I said in the OP, I think the problem is the src url became in-accessible after the browser loads it.

It's reliably re-producible in that specific test, I've done that in more than 10 times back and forth, the only difference is the version of Tesseract I used.

I suspect 2.1.5 worked because it didn't read the image by url, maybe it read image data from the img tag directly, something similar to: https://stackoverflow.com/a/10755011/266741

Or there's some difference between 2.1.5 and later versions which causes 2.1.5's fetch to hit browser cache, and later versions doesn't.

Balearica commented 1 year ago

In both versions 2.1.5 and 4.0.3 running recognition on an <img> element simply results in the src attribute being recognized--so (in this case) the input ultimately ends up being a URL string in both versions. While there were changes made to how image URLs are processed since version 2.1.5, those changes were made to resolve a major bug in Tesseract.js (see #604), so reverting to what that code was before is not an option.

I would be open to merging changes to the loadImage function if you investigate and come up with an improvement, however I don't think the current behavior constitutes a bug (given that it is simply fetching an input URL) and don't personally have time to investigate exactly how the changes caused the behavior you describe.

Jimmy-Z commented 1 year ago

Thanks for the explanation.

After reading API doc, I found out Tesseract supports canvas element too, so I'm able to write a workaround by creating a canvas, draw that img on that canvas (basically the stack overflow post I referenced above), then call Tesseract on that canvas, this solved my issue with newer Tesseract.