tberg12 / ocular

Ocular is a state-of-the-art historical OCR system.
GNU General Public License v3.0
249 stars 48 forks source link

Can't handle JBIG2 pdf images #19

Open bholtdwyer opened 1 year ago

bholtdwyer commented 1 year ago

When I run the training step on some .pdfs of historical Indian census files, I get the following error:

Extracting text line images from ../data/district_reports/raw_pdfs/1981/27582_1981_MAI.pdf, page 3
Error reading image
com.sun.pdfview.PDFParseException: Unknown coding method:JBIG2Decode

and then

java.lang.NullPointerException
    at com.sun.pdfview.font.TTFFont.getOutline(TTFFont.java:170)
    at com.sun.pdfview.font.CIDFontType2.getOutline(CIDFontType2.java:270)
    at com.sun.pdfview.font.OutlineFont.getGlyph(OutlineFont.java:130)
    at com.sun.pdfview.font.PDFFont.getCachedGlyph(PDFFont.java:308)
    at com.sun.pdfview.font.PDFFontEncoding.getGlyphFromCMap(PDFFontEncoding.java:155)
    at com.sun.pdfview.font.PDFFontEncoding.getGlyphs(PDFFontEncoding.java:115)
    at com.sun.pdfview.font.PDFFont.getGlyphs(PDFFont.java:274)
    at com.sun.pdfview.PDFTextFormat.doText(PDFTextFormat.java:269)
    at com.sun.pdfview.PDFParser.iterate(PDFParser.java:752)
    at com.sun.pdfview.BaseWatchable.run(BaseWatchable.java:101)
    at java.base/java.lang.Thread.run(Thread.java:834)

I think what's going on here is that the .pdf contains .jbig2 images, but the program doesn't know how to read these.

taineleau commented 1 year ago

Maybe try to convert the images first?

bholtdwyer commented 1 year ago

Fair enough; since this is a rarely used PDF format it may not be worth the time to add to the package. Thanks for making it available!

On Tue, Jun 6, 2023 at 1:18 PM Danlu Chen @.***> wrote:

Maybe try to convert the images first?

— Reply to this email directly, view it on GitHub https://github.com/tberg12/ocular/issues/19#issuecomment-1579388156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM7PU2SPVRQHFIUKP7N4LTTXJ6GABANCNFSM6AAAAAASRVOA5I . You are receiving this because you authored the thread.Message ID: @.***>