pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Error when loading pdf using python BytesIO: object has no attribute 'page_count' #38

Closed danmb1979 closed 5 months ago

danmb1979 commented 5 months ago

Python 3.10.14 pymupdf4llm version 0.5

Trying to read a pdf from an S3 bucket (file_content in the code below) and run pymupdf4llm later, but got an error. I used BytesIO object. This works fine when loading pdf from local disk (i.e. without BytesIO)

try:
        file_content = s3.get_object(Bucket=XXXXX, Key=XXXX)['Body'].read()        
except Exception as e:
        print(e)
        print(f"""Error getting object {XXXX} from bucket {XXXX}. Make sure they exist and your bucket is in the same region as this function."""
        raise e

md_file = pymupdf4llm.to_markdown(BytesIO(file_content)) 

# AttributeError: '_io.BytesIO' object has no attribute 'page_count'
JorjMcKie commented 5 months ago

For the time being, input documents must be either given as a pathname (in string format) or as a PyMuPDF Document. If you have a document given in some binary format (bytes/ io.BytesIO) you must open it as a Document first and use that document to pymupdf4llm.