Closed coder0028 closed 4 years ago
Hi @coder0028 Thanks for your interest in the library. What type of non-tabular data do you want to extract? You can extract the text using page.extract_text()
. Words using page.extract_words()
. If you want non-tabular text, you can apply a postprocessing logic to filter out words that fall in a table's bounding box (can also look at the .filter()
method). If you want to extract form values, have a look at https://github.com/jsvine/pdfplumber#extracting-form-values.
Thanks @samkit-jain for the response.
PDF here comprises of tables(tabular), headings, paragraphs (non-tabular). I am able to fetch table data correctly but how can i get rest of the paragraphs and headings' text? Also, can you please help with the .filter()
method.
Have a look at https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246 You can refer it as an example of .filter()
and also to get the data outside of a table.
If you could share the PDF (after redacting any sensitive information) and your expected output, I can help in a better way. Without the PDF, I can suggest you to do filtering based on the font name, character height etc to get the data you need.
Thanks @samkit-jain. This https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246 works !
I am trying to extract both tabular and non-tabular text from PDF. However, fetching tabular information is pretty straight forward using pdf.extract_tables() but I couldn't find any way to get rest of the text from PDF. Is there any way that I might be missing?