jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.83k stars 680 forks source link

no root object error #419

Closed shm007g closed 3 years ago

shm007g commented 3 years ago

Describe the bug

Open regular pdf file, it ends with no root object error.

Code to reproduce the problem

    if not p.endswith('.pdf'):
        continue
    with open(os.path.join(pdf_dir, p), 'rb') as f:
        with pdfplumber.open(f) as pdf:
            first_page = pdf.pages[0]
            print(first_page.extract_text())

PDF file

20000101.pdf

Expected behavior

get real text data from pdf file.

Actual behavior

error happens.

Screenshots

Traceback (most recent call last):
  File "/home/xxx/anaconda3/bin/pdfplumber", line 8, in <module>
    sys.exit(main())
  File "/home/xxx/anaconda3/lib/python3.6/site-packages/pdfplumber/cli.py", line 49, in main
    with PDF.open(args.infile, pages=args.pages) as pdf:
  File "/home/xxx/anaconda3/lib/python3.6/site-packages/pdfplumber/pdf.py", line 60, in open
    return cls(path_or_fp, **kwargs)
  File "/home/xxx/anaconda3/lib/python3.6/site-packages/pdfplumber/pdf.py", line 33, in __init__
    self.doc = PDFDocument(PDFParser(stream), password=password)
  File "/home/xxx/anaconda3/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 572, in __init__
    raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?

Environment


pdfminer.six==20200517
pdfplumber==0.5.27
Python: 3.6.4
OS: centos
samkit-jain commented 3 years ago

Hi @shm007g Appreciate your interest in the library. This is not a bug of the library but rather because the PDF is not correctly formed (as per the PDF specification). You may try repairing the PDF using ghostscript and then using it. You may run the following command to do so

gs -o output.pdf -sDEVICE=pdfwrite input.pdf

I repaired the PDF and you can download it from here.

shm007g commented 3 years ago

Thanks very much! Can you tell me how you repaire? I Have a can't find cid font error. Not fix this for one day.

gs -o output.pdf -sDEVICE=pdfwrite input.pdf

Page 1
Can't find CID font "����".
Attempting to substitute CID font /Adobe-GB1 for /����, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-GB1" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/local/Cellar/ghostscript/9.53.3_1/share/ghostscript/9.53.3/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-GB1 ... Done.
Can't find CID font "SimSun".
Attempting to substitute CID font /Adobe-GB1 for /SimSun, see doc/Use.htm#CIDFontSubstitution.
samkit-jain commented 3 years ago

Yes, I also received the same message but I was able to extract all the text still so I think the output file should be good to go. Did you find any issues with it?

shm007g commented 3 years ago

Some tokens lost or changed to unvalid one. I also tried merge command like gs -q -sDEVICE=pdfwrite -dBATCH -sOUTPUTFILE=${line%.*}_mod.pdf -dNOPAUSE "${line} same problem but does not have this font errors.