sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2.02k stars 254 forks source link

What are the `page_index` and `filename` arguments in ProcessPage() ? #167

Closed munikarmanish closed 5 years ago

munikarmanish commented 5 years ago

I'm trying to convert a PIL Image into a searchable PDF. For image files, ProcessPages(outbase, image_filename) works perfectly. For PIL Image, it seems ProcessPage() is the equivalent method. But there are two additional arguments. I tried setting:

It generated a corrupt PDF file. Can anyone please help me on proper usage of ProcessPage() method?

Some info that might be helpful:

sirfz commented 5 years ago

according to the docs filename and page_index are metadata used by side-effect processes, such as reading a box file or formatting as hOCR.

I haven't worked with multi-page tiff images myself but I believe what you need to do is use PIL's seek method to iterate pages in the tiff. Something like:

pages = 3
for page in range(pages):
    img.seek(page)
    api.ProcessPage('output{}'.format(page), img, page, 'page{}'.format(page))