Open krambox opened 4 years ago
If you can share your PDF form, or a redacted/modified version of it, I can write a yaml and add it to the examples.
Can help to write a yaml example for below PDF? Sample_Certificate.pdf Want to extract 3 fields: 1) Name (i.e. Alvin Lam) 2) Course Name (i.e. Get Ready: Club Public Image Committee ) 3) Completion Date (i.e. 5/25/20)
@stephenbyrne-mfj I've added some stuff in issue #19. Maybe you can use that, if we can get it working.
---- UPDATED ---
It has now been fixed, so feel free to use it and include those docs and images into your example folders.
For Sample_Certificate.pdf
:
---
extractor: "pdf.itext5"
header:
default: 200 # Ignore the "Learning and Development" portion
footer:
default: 500
maxRowDistance: 2
rootRecordType: certificate
recordTypes:
certificate:
label: "Certificate"
valueTypes:
- name
- course
- date
valueTypes:
name:
label: "Name"
course:
label: "Course Name"
date:
label: "Completion Date"
# strip out the "on " before the date
replacements:
-
pattern: "on\ *(.*)"
replacement: "$1"
initialState: "INIT"
states:
INIT:
transitions:
-
condition: certifiesThat
nextState: certifiesThat
certifiesThat:
include: false
transitions:
-
condition: any
nextState: name
name:
transitions:
-
condition: hasSuccessfullyCompleted
nextState: hasSuccessfullyCompleted
-
condition: any
nextState: name
hasSuccessfullyCompleted:
include: false
transitions:
-
condition: any
nextState: course
course:
transitions:
-
condition: date
nextState: date
-
condition: any
nextState: course
date:
transitions:
-
condition: certifiesThat
nextState: certifiesThat
conditions:
any: '1 = 1'
certifiesThat: 'text = "Certifies that"'
hasSuccessfullyCompleted: 'text = "has successfully completed"'
date: 'text =~ /on .*/ and fontSize = 14.0'
Generates:
page,Name,Course Name,Completion Date
1,Alvin Lam,Get Ready: Club Public Image Committee,5/25/20
@krambox This is a good simple example. If you create pull request to add the PDF to src/test/resources/io/mfj/textricator/examples/
, we will include it.
I do not want to assume that we can distribute your PDF; you submitting the PR makes the permission more explicit.
Speaking of tutorial.
One of the most bewildering things about using this tool, is understanding the FSM code. (I still don't TBH.) Can you provide some links to how we can learn the YML for the FSM. Is it general enough for it's use here?
The second most challenging part is making the precise and correct measurements on the original PDF files.
I Strongly suggest that each example PDF file is supplemented with an image of the very same PDF, but which include the measures drawn in, as I have done in issue #19. (See below instructions.)
This is very easy to do in Acrobat Reader DC, if you open additional tools. Then just select Measuring Tool
and right-click somewhere on the page. There you have to first select Change Scale Ration and Precision
and use 1 pt = 1 pt
in the pop-up box. There are some bugs in Adobe that (a) make it's forget these settings, and (b) gets really confused when measuring close to the edge of the document. It take a few tries until you get the hang of it.
Maybe you should add the above note and picture to your Wiki, or to the readme, or wherever it can be easily found....
I usually run textricator text input.pdf input-text.csv
and then open input-text.csv
in Libreoffice Calc (or whatever your favorite spreadsheet tool is). I just added a section to the readme about this: https://github.com/measuresforjustice/textricator/commit/97b7b331b611c4fe140e63a43cd415de91c6afb0
Writing the FSM code is definitely the hardest part. I agree there should be a tutorial (or maybe a video?) that explains how it works and walks through developing one for a simple document.
Yeah, usually when people do FSM, they include a diagram with a "loopy" chart, known as a State Diagram. It would be great if we could find a tool to generate this for us...
The links from Wikipedia:
From the old SMC tutarial:
Does anyone know a tutorial to get started or a few more (but simpler examples) for yaml files? I'm having a hard time right now. For example, I want to extract a number from a PDF form in exactly one position.