Closed ibrahimshuail closed 3 years ago
Hi @ibrahimshuail Appreciate your interest in the library. Could you please provide the PDF (sensitive information reacted)?
employee_details.pdf please find the attached pdf @samkit-jain
Thanks for sharing the PDF @ibrahimshuail You would have to use regex based searching here. An example for extracting the employee name, DOB and name
import re
import pdfplumber
pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
text = p.extract_text(y_tolerance=10, x_tolerance=3)
employee_name = re.search(r"Employee\s*Name\s*:\s*([A-Za-z ]+)\s*Employee", text).group(1)
dob = re.search(r"DOB\s*(\d{2}-\d{2}-\d{4})", text).group(1)
gender = re.search(r"Gender\s*:\s*([a-z]+)", text).group(1)
print(employee_name)
print(dob)
print(gender)
The code I have shared is very minimal and I am pretty sure you would need to do some processing and error handling. But, I do believe it would give you a direction to work in. For example, I had to specify a higher y_tolerance
when extracting text so the actual text is similar to the visual one.
Extracted text with .extract_text(y_tolerance=10, x_tolerance=3)
Employee Details
Employee Name : Ibrahim Employee ID : hecemp102285
DOB 29-09-1993 Gender : male
Designation : Software developer Reporting Manager : Mike
Mobile : 9851539 City Bangalore
Extracted text with .extract_text()
Employee Details
Employee Name
: Ibrahim Employee ID : hecemp102285
DOB 29-09-1993 Gender : male
Designation : Software developer Reporting Manager : Mike
Mobile : 9851539 City Bangalore
I would recommend that you also have a look at https://github.com/invoice-x/invoice2data
@samkit-jain I don't want to extract the value.. I need to extract only the sub headers... Not thier values... This is not a standard pdf in few pdf we have employee name... In few we have column name as just name.... The above code u provided is like we are defining the subheaders... Can u suggest something which I can capture the subheaders with bold letters and colon or something like that...
Oh ok. Thanks for the clarification @ibrahimshuail You can use the following code to keep only the bold characters.
def keep_bold_chars(obj):
if obj['object_type'] == 'char':
return 'Bold' in obj['fontname']
return True
page = page.filter(keep_bold_chars)
In the PDF that you shared, the bold characters have the font name as "ABCDEE+Calibri,Bold".
Then, to get the individual subheader keys like "DOB", "Gender", you can perform words clustering. Get a list of all the words from extract_words()
. Iterate over them top to bottom and left to right via the word coordinate values. Going left to right, when the difference between 2 words exceeds a certain threshold, consider it a cluster and proceed further. Each cluster would give you the subheader key you need.
Assuming that there would always be non-bold text to the right of the bold text, an easier implementation would be the following
import re
import pdfplumber
pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
subheaders = []
val = ''
for word in p.extract_words(x_tolerance=3, y_tolerance=10, extra_attrs=['fontname'])[2:]: # Using [2:] to skip the main title
if 'Bold' in word['fontname']:
val = ' '.join([val, word['text']]).strip()
else:
if val != '':
subheaders.append(val)
val = ''
print(subheaders)
# ['Employee Name :', 'Employee ID :', 'DOB', 'Gender :', 'Designation :', 'Reporting Manager :', 'Mobile :', 'City']
It makes use of the extra_attrs
argument added in v0.5.24
Thanks a lot, @samkit-jain. Really the solution helped me a lot ...
what about if i want to extract headers and contents under it together, as in chunking whole pdf into sections, does anyone can help me on this?
what about if i want to extract headers and contents under it together, as in chunking whole pdf into sections, does anyone can help me on this? @iaditij
Hi,
Did you find any method to do the described task, I have a similar use-case for which I'm searching any resources that can be used. I am aware of a lib named llmsherpa
which might just be useful for you but it makes an external API call to an unknown endpoint since I have some confidentiality concerns I cannot use this method pls suggest any alternative that you know of.
Thanks in advance.
I need to extract the header and sub header of the pdf files . I tried using RE for extracting the sub header before the colon but few sub headers doesn't have the colon so I tried using beautiful soap for conversion of pdf to html and getting the sub headers but this also fails , is there any way in plumber I can extract on the header and sub headers , I have attached a sample pdf in which I need the below format of output
{ "Employee Details", "Employee Name", "Employee ID ", "DOB ", " Gender", " Designation", " Reporting Manager", " Mobile ", "City " }
I don't know whether its a simple one , but I got stuck into this .