shmublu / pdf2txt

0 stars 0 forks source link

Handling Nested Lists #1

Open YoshikiTakashima opened 1 month ago

YoshikiTakashima commented 1 month ago

Hi Shmuel.

I'm Yoshiki, Scott's postdoc.

Is it possible for your tool to handle nested lists? Here's an example: example.pdf

python pdf2text.py example.pdf  ./output.txt --max_pages 50 --merge_headers False
Ignoring wrong pointing object 7 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 7 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Traceback (most recent call last):
  File "/Users/yoshikitakashima/pdf2txt/pdf2text.py", line 157, in <module>
    main()
  File "/Users/yoshikitakashima/pdf2txt/pdf2text.py", line 154, in main
    parse_pdf(args.pdf_path, args.output_path, args.max_pages, args.merge_headers)
  File "/Users/yoshikitakashima/pdf2txt/pdf2text.py", line 107, in parse_pdf
    median_length = sorted(line_lengths)[len(line_lengths) // 2]
                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

I can't find the example file that worked, but even when it works, it still deletes nesting relationships. Those need to be preserved for our use case.

Thanks ~Yoshiki

shmublu commented 1 month ago

Have you tried using the OCR script? It's obviously more expensive as it requires LLM api access, but you should get much better results. I can take a look at this later, but I had difficulty getting this type of relationship to be present in the final txt.

shmublu commented 1 month ago

I took a look and unfortunately it seems to be a problem with the underlying parser I am using. It doesn't detect any text in the PDF at all... I added an error message. Did you have any luck with the OCR version?

YoshikiTakashima commented 1 month ago

OCR Works.