Extract header and Subheader

ibrahimshuail commented 3 years ago

I need to extract the header and sub header of the pdf files . I tried using RE for extracting the sub header before the colon but few sub headers doesn't have the colon so I tried using beautiful soap for conversion of pdf to html and getting the sub headers but this also fails , is there any way in plumber I can extract on the header and sub headers , I have attached a sample pdf in which I need the below format of output

{ "Employee Details", "Employee Name", "Employee ID ", "DOB ", " Gender", " Designation", " Reporting Manager", " Mobile ", "City " }

I don't know whether its a simple one , but I got stuck into this .

samkit-jain commented 3 years ago

Hi @ibrahimshuail Appreciate your interest in the library. Could you please provide the PDF (sensitive information reacted)?

ibrahimshuail commented 3 years ago

employee_details.pdf please find the attached pdf @samkit-jain

samkit-jain commented 3 years ago

Thanks for sharing the PDF @ibrahimshuail You would have to use regex based searching here. An example for extracting the employee name, DOB and name

import re
import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
text = p.extract_text(y_tolerance=10, x_tolerance=3)

employee_name = re.search(r"Employee\s*Name\s*:\s*([A-Za-z ]+)\s*Employee", text).group(1)
dob = re.search(r"DOB\s*(\d{2}-\d{2}-\d{4})", text).group(1)
gender = re.search(r"Gender\s*:\s*([a-z]+)", text).group(1)

print(employee_name)
print(dob)
print(gender)

The code I have shared is very minimal and I am pretty sure you would need to do some processing and error handling. But, I do believe it would give you a direction to work in. For example, I had to specify a higher y_tolerance when extracting text so the actual text is similar to the visual one.

Extracted text with .extract_text(y_tolerance=10, x_tolerance=3)

Employee  Details 

Employee  Name : Ibrahim                       Employee   ID : hecemp102285                        
DOB     29-09-1993                                               Gender : male                        
Designation :    Software developer                                                 Reporting Manager : Mike                        
Mobile : 9851539                       City  Bangalore

Extracted text with .extract_text()

Employee  Details 

Employee  Name
 : Ibrahim                       Employee   ID : hecemp102285                        
DOB     29-09-1993                                               Gender : male                        
Designation :    Software developer                                                 Reporting Manager : Mike                        
Mobile : 9851539                       City  Bangalore

I would recommend that you also have a look at https://github.com/invoice-x/invoice2data

ibrahimshuail commented 3 years ago

@samkit-jain I don't want to extract the value.. I need to extract only the sub headers... Not thier values... This is not a standard pdf in few pdf we have employee name... In few we have column name as just name.... The above code u provided is like we are defining the subheaders... Can u suggest something which I can capture the subheaders with bold letters and colon or something like that...

samkit-jain commented 3 years ago

Oh ok. Thanks for the clarification @ibrahimshuail You can use the following code to keep only the bold characters.

def keep_bold_chars(obj):
    if obj['object_type'] == 'char':
        return 'Bold' in obj['fontname']
    return True

page = page.filter(keep_bold_chars)

In the PDF that you shared, the bold characters have the font name as "ABCDEE+Calibri,Bold".

Then, to get the individual subheader keys like "DOB", "Gender", you can perform words clustering. Get a list of all the words from extract_words(). Iterate over them top to bottom and left to right via the word coordinate values. Going left to right, when the difference between 2 words exceeds a certain threshold, consider it a cluster and proceed further. Each cluster would give you the subheader key you need.

Assuming that there would always be non-bold text to the right of the bold text, an easier implementation would be the following

import re
import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
subheaders = []
val = ''

for word in p.extract_words(x_tolerance=3, y_tolerance=10, extra_attrs=['fontname'])[2:]:  # Using [2:] to skip the main title 
    if 'Bold' in word['fontname']:
        val = ' '.join([val, word['text']]).strip()
    else:
        if val != '':
            subheaders.append(val)
        val = ''

print(subheaders)
# ['Employee Name :', 'Employee ID :', 'DOB', 'Gender :', 'Designation :', 'Reporting Manager :', 'Mobile :', 'City']

It makes use of the extra_attrs argument added in v0.5.24

ibrahimshuail commented 3 years ago

Thanks a lot, @samkit-jain. Really the solution helped me a lot ...

iaditij commented 9 months ago

what about if i want to extract headers and contents under it together, as in chunking whole pdf into sections, does anyone can help me on this?

Anurag-38 commented 7 months ago

what about if i want to extract headers and contents under it together, as in chunking whole pdf into sections, does anyone can help me on this? @iaditij

Hi,

Did you find any method to do the described task, I have a similar use-case for which I'm searching any resources that can be used. I am aware of a lib named llmsherpa which might just be useful for you but it makes an external API call to an unknown endpoint since I have some confidentiality concerns I cannot use this method pls suggest any alternative that you know of.

Thanks in advance.

jsvine / pdfplumber

Extract header and Subheader #299