tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.46k stars 9.32k forks source link

Error during processing of HEIC input files #2930

Open robskrob opened 4 years ago

robskrob commented 4 years ago

Environment

Current Behavior:

Whenever I execute $ tesseract images/IMG_3958.HEIC output/grocery_bill I get this error:

$ tesseract images/IMG_3958.HEIC output/grocery_bill
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error during processing.

Expected Behavior:

I would expect tesseract to output the text from the grocery bill into the output file output/grocery_bill.

Is there something wrong with processing HEIC images? Also, is there a location where I can tail the logs to see if I can get a richer description of the error?

Here's more information about the tesseract program that I installed with Homebrew:

$ tesseract -v
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

Also attached please find the image I had tesseract process. IMG_3958.HEIC.zip

stweil commented 4 years ago

You closed that, so what was the solution?

robskrob commented 4 years ago

@stweil I apologize for closing without an explanation.

The problem is the image is not a JPG. It is a HEIC. When take a photo with my iphone, upload it to google drive and then download this photo onto my machine, the image is saved as a HEIC file, which to be honest is a new file extension to me.

So the "solution" for me was to convert the file on my machine from HEIC to JPG. Once I did this tesseract had no problem processing the image.

I suppose then my original issue is still a bug -- except there's nothing wrong with the JPG file. It's the HEIC file:

$ tesseract images/IMG_3958.HEIC output/grocery_bill
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error during processing.

Attached please find the HEIC file

IMG_3958.HEIC.zip

stweil commented 4 years ago

Thanks. So the error message should be improved and report that the input file could not be read. That is not specific for macOS, therefore I changed the title.

robskrob commented 4 years ago

Yeah I wanted to tail the logs of my tesseract process so I could potentially learn more about what was going on. I definitely think the error message could be improved. And yes, there's something about the input file that appears to not be readable. And yes, changing the title of the issue makes sense.

stweil commented 4 years ago

See the Wikipedia article on HEIC or HEIF. Maybe Leptonica can be extended to support that new format (so Tesseract would support it, too). There exists a C library libheif which might be used. @DanBloomberg, would that be interesting?

stweil commented 4 years ago

It looks like the format is covered by patents. Debian provides libheif nevertheless, and GIMP supports it. I don't know whether its use in Leptonica and Tesseract would be a problem.

2021-12-04: libheif uses the GNU Lesser General Public License which should be compatible with Leptonica. See https://github.com/strukturag/libheif/.

DanBloomberg commented 4 years ago

I've asked an expert on coding about this format, and will report here when I hear back.

For leptonica to support it is a high bar, because support is a serious commitment and invariably a lot of work. That work will include passing very intensive fuzzing, both internally at Google and externally by oss-fuzz on github.

amitdo commented 3 years ago

Maybe Leptonica/Tesseract can support gdk-pixbuf, which will bring indirect support for heic and avif.

DanBloomberg commented 3 years ago

Each new format supported demands a big cost in development and support effort.

As mentioned on leptonica#546, jpeg-xl seems more interesting: both easier to support and having compatibility with jpeg.

amitdo commented 3 years ago

Dan. I understand your view.

The thing is, these two formats (HEIC and AVIF) are becoming quite popular. HEIC is the default file format for photos in iOS. Recently, Firefox followed Chrome's lead, and it now supports AVIF by default (previously users had to enable it manually).

So, since you do not want to support these formats in your software. maybe we, Tesseract devs, should discuss whether we want to support them ourselves.

If we will decide to support them. we will convert them to Leptonica's pix.

To be honest, this was mainly directed toward @stweil, hoping that he would want this enough to implement it... :-)

DanBloomberg commented 3 years ago

Supporting AVIF and HEIC within tesseract is certainly an option. As I said, my experience with the older I/O libraries is that it's a lot of work. webp has been easier, because there are not a lot of options and the implementation is more "modern", with the basic encoder and decoder going between memory and not file streams (or, worse, with tiff, unix file descriptors). With the fuzzers that have been made for leptonica, both internally in Google and now externally with the oss-fuzz project, maintenance work has increased considerably.

But overall, maintenance on the I/O libraries has been a significant time-sink. I'm glad to have been able to relieve Ray Smith and tesseract of that burden.

I can't promise anything about jpeg-xl, but it does seem to be something I should look into.

amitdo commented 3 years ago

I'm glad to have been able to relieve Ray Smith and tesseract of that burden.

Last time we saw Ray here was 2 years ago. I don't know if he plans to return to contribute to this project.

amitdo commented 2 years ago

So, since you do not want to support these formats in your software. maybe we, Tesseract devs, should discuss whether we want to support them ourselves.

Answering myself: We should not do it.

amitdo commented 2 years ago

Still, a proper error message for unsupported image formats is desired.

DanBloomberg commented 2 years ago

pixReadStream() emits the message: Unknown format: no pix returned if the format is not supported for reading.

I believe that tesseract should not rely on leptonica error messages -- you might not even emit them by default. Instead, tesseract should have its own error message if the image file can't be read to a pix.

zdenop commented 2 years ago

The funny part is that OCR process of this image fails here: https://github.com/tesseract-ocr/tesseract/blob/b5878c23a70a6709e722aea5a3304a4c5c87313b/src/api/baseapi.cpp#L974-L976

Because when the file format is unknown (IFF_UNKNOWN), tesseract API(???) expects it is a file list:

https://github.com/tesseract-ocr/tesseract/blob/b5878c23a70a6709e722aea5a3304a4c5c87313b/src/api/baseapi.cpp#L1201-L1210

I think we should remove this "guess what is the input" out of OCR API...

vid-bin commented 10 months ago

Is there any update on this? My entire photo library is heic (including screenshots). I’ve been trying to get OCR working with heic on fscrawler and Nextcloud with no success.

OCR works on my Mac because Apple has support for it in their system but I want to get this working on my NAS.

the only solution would be to convert my library to jpeg which I do not want to do.

amitdo commented 10 months ago

We don't plan to support the HEIC format.

stweil commented 10 months ago

No, there isn't any update, neither for Leptonica nor for Tesseract. The only solution is currently to convert from HEIC to JPEG, so Tesseract can process the JPEG file.

vid-bin commented 10 months ago

Does tesseract support AVIF or jpeg-xl then? I don’t want to convert to jpeg because I’ll lose HDR on my photos. The storage space is also significantly higher when converting my images.

Ideally the applications indexing the files would convert to jpeg on the fly for tesseract and then delete the temporary file when done but the ones I’m trying to use do not do this.

DanBloomberg commented 10 months ago

We've looked a couple of times at jpeg-xl. There is a very high bar to cross before deciding to support a new format with leptonica, and the reasons for this have been described several times. AVIF doesn't meet it. Initially, jpeg-xl was an interesting possibility because it was supported by Google, it is a significantly more efficient encoder than jpeg, and it is designed to have very good compatibility with the jpeg library. All these things meant that there was some likelihood of widespread adoption within a few years, essentially as a replacement for jpeg, and the work of supporting it in leptonica would be less than that for a completely new format.

Just a reminder: part of the work to support any new format is to build and run fuzzers for months, and to harden it not to crash or hang for any possible input.

As of 2022, however, it was evident that Google was not supporting it. For details why it is no longer of interest, see: https://github.com/DanBloomberg/leptonica/issues/692#issuecomment-1501259293

amitdo commented 10 months ago

https://tesseract-ocr.github.io/tessdoc/InputFormats

DanBloomberg commented 10 months ago

For the tesseract input formats page, you might clarify that reading animated webp (a-webp) is not supported -- only writing.

I lost interest in fully supporting a-webp when I found that Google had so little interest in supporting their own format (which is far superior in compression to animated gif) that even with several billion gmail accounts, they didn't let you view an a-webp attachment! Only the first frame is displayed. Nevertheless, a-webp is supported in browsers; see the Google webp faq: https://developers.google.com/speed/webp/faq

amitdo commented 10 months ago

Hi Dan,

Thanks for the info. Is the description on that page in the 'Animated GIF' section also correct for animated WebP?

DanBloomberg commented 10 months ago

No, leptonica can not read a webp anim file. It does not return the first image.

Here are the error messages:

Error in pixReadMemWebP: WebP decode failed
Error in pixReadStream: webp: no pix returned