PDF Failures - Githubissues

brimwats1 commented 2 years ago

Hello!

I attempted two different PDFs (which I can share in a DM or a email) — one was an older-pre computer PDF that had been OCRed professionally and hilighted. Another was a modern PDF, of a book published last year, also highlighted. Zotero was able to extract from both using pdfjs. When i use https://huggingface.co/spaces/paulbricman/decontextualizer i get:

File "/home/user/.local/lib/python3.8/site-packages/streamlit/script_runner.py", line 354, in _run_script
    exec(code, module.__dict__)
File "/home/user/app/main.py", line 17, in <module>
    components.add_section()
File "/home/user/app/components.py", line 46, in add_section
    excerpts = pdf_to_excerpts(filename)
File "/home/user/app/processing.py", line 48, in pdf_to_excerpts
    excerpt = extract_annot(annot, words)
File "/home/user/app/processing.py", line 24, in extract_annot
    quad_count = int(len(quad_points) / 4)

paulbricman commented 2 years ago

Hi @brimwats, could you please (1) send me the PDFs either here or by email (listed on my GitHub profile), and (2) also include what I suspect is a missing line from the error's trace, namely the error itself?

brimwats1 commented 2 years ago

re 1) sure, sending in a moment re 2) I don't get any error beyond the trace, I used the online version you've linked to https://huggingface.co/spaces/paulbricman/decontextualizer

paulbricman commented 2 years ago

Got the email, thanks for the prompt reply! I'll try processing the PDFs a few days into January. Happy holidays till then!

paulbricman commented 2 years ago

Hi, @brimwats! So I pushed a version which handles the extraction part a bit better than before. However, from what I've seen the model has a hard time working with 3+ sentence highlights, and works best with 1-2 sentence highlights. I'm afraid the current version of the tool won't be of much use in your situation :disappointed:

paulbricman / decontextualizer

PDF Failures #2