jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Fetch non tabular data from PDF #276

Closed coder0028 closed 4 years ago

coder0028 commented 4 years ago

I am trying to extract both tabular and non-tabular text from PDF. However, fetching tabular information is pretty straight forward using pdf.extract_tables() but I couldn't find any way to get rest of the text from PDF. Is there any way that I might be missing?

samkit-jain commented 4 years ago

Hi @coder0028 Thanks for your interest in the library. What type of non-tabular data do you want to extract? You can extract the text using page.extract_text(). Words using page.extract_words(). If you want non-tabular text, you can apply a postprocessing logic to filter out words that fall in a table's bounding box (can also look at the .filter() method). If you want to extract form values, have a look at https://github.com/jsvine/pdfplumber#extracting-form-values.

coder0028 commented 4 years ago

Thanks @samkit-jain for the response.

PDF here comprises of tables(tabular), headings, paragraphs (non-tabular). I am able to fetch table data correctly but how can i get rest of the paragraphs and headings' text? Also, can you please help with the .filter() method.

samkit-jain commented 4 years ago

Have a look at https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246 You can refer it as an example of .filter() and also to get the data outside of a table.

If you could share the PDF (after redacting any sensitive information) and your expected output, I can help in a better way. Without the PDF, I can suggest you to do filtering based on the font name, character height etc to get the data you need.

coder0028 commented 4 years ago

Thanks @samkit-jain. This https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246 works !