measuresforjustice / textricator

Textricator is a tool to extract text from documents and generate structured data.
https://textricator.mfj.io
GNU Affero General Public License v3.0
346 stars 38 forks source link

Tutorial or Quickstart? #11

Open krambox opened 4 years ago

krambox commented 4 years ago

Does anyone know a tutorial to get started or a few more (but simpler examples) for yaml files? I'm having a hard time right now. For example, I want to extract a number from a PDF form in exactly one position.

stephenbmfj commented 4 years ago

If you can share your PDF form, or a redacted/modified version of it, I can write a yaml and add it to the examples.

alvinets commented 3 years ago

Can help to write a yaml example for below PDF? Sample_Certificate.pdf Want to extract 3 fields: 1) Name (i.e. Alvin Lam) 2) Course Name (i.e. Get Ready: Club Public Image Committee ) 3) Completion Date (i.e. 5/25/20)

eabase commented 3 years ago

@stephenbyrne-mfj I've added some stuff in issue #19. Maybe you can use that, if we can get it working.

---- UPDATED ---

It has now been fixed, so feel free to use it and include those docs and images into your example folders.

stephenbmfj commented 3 years ago

For Sample_Certificate.pdf:

---
extractor: "pdf.itext5"

header:
  default: 200 # Ignore the "Learning and Development" portion
footer:
  default: 500

maxRowDistance: 2

rootRecordType: certificate
recordTypes:
  certificate:
    label: "Certificate"
    valueTypes:
      - name
      - course
      - date

valueTypes:
  name:
    label: "Name"
  course:
    label: "Course Name"
  date:
    label: "Completion Date"
    # strip out the "on " before the date
    replacements:
      -
        pattern: "on\ *(.*)"
        replacement: "$1"

initialState: "INIT"

states:
  INIT:
    transitions:
      -
        condition: certifiesThat
        nextState: certifiesThat

  certifiesThat:
    include: false
    transitions:
      -
        condition: any
        nextState: name

  name:
    transitions:
      -
        condition: hasSuccessfullyCompleted
        nextState: hasSuccessfullyCompleted
      -
        condition: any
        nextState: name

  hasSuccessfullyCompleted:
    include: false
    transitions:
      -
        condition: any
        nextState: course

  course:
    transitions:
      -
        condition: date
        nextState: date
      -
        condition: any
        nextState: course

  date:
    transitions:
      -
        condition: certifiesThat
        nextState: certifiesThat

conditions:

  any: '1 = 1'

  certifiesThat: 'text = "Certifies that"'

  hasSuccessfullyCompleted: 'text = "has successfully completed"'

  date: 'text =~ /on .*/ and fontSize = 14.0'

Generates:

page,Name,Course Name,Completion Date
1,Alvin Lam,Get Ready: Club Public Image Committee,5/25/20
stephenbmfj commented 3 years ago

@krambox This is a good simple example. If you create pull request to add the PDF to src/test/resources/io/mfj/textricator/examples/, we will include it.

I do not want to assume that we can distribute your PDF; you submitting the PR makes the permission more explicit.

eabase commented 3 years ago

Speaking of tutorial.

  1. One of the most bewildering things about using this tool, is understanding the FSM code. (I still don't TBH.) Can you provide some links to how we can learn the YML for the FSM. Is it general enough for it's use here?

  2. The second most challenging part is making the precise and correct measurements on the original PDF files.
    I Strongly suggest that each example PDF file is supplemented with an image of the very same PDF, but which include the measures drawn in, as I have done in issue #19. (See below instructions.)

Measuring up your PDF

This is very easy to do in Acrobat Reader DC, if you open additional tools. Then just select Measuring Tool and right-click somewhere on the page. There you have to first select Change Scale Ration and Precision and use 1 pt = 1 pt in the pop-up box. There are some bugs in Adobe that (a) make it's forget these settings, and (b) gets really confused when measuring close to the edge of the document. It take a few tries until you get the hang of it.

AcroRd32_2021-05-17_23-23-45


Maybe you should add the above note and picture to your Wiki, or to the readme, or wherever it can be easily found....

stephenbmfj commented 3 years ago

I usually run textricator text input.pdf input-text.csv and then open input-text.csv in Libreoffice Calc (or whatever your favorite spreadsheet tool is). I just added a section to the readme about this: https://github.com/measuresforjustice/textricator/commit/97b7b331b611c4fe140e63a43cd415de91c6afb0

stephenbmfj commented 3 years ago

Writing the FSM code is definitely the hardest part. I agree there should be a tutorial (or maybe a video?) that explains how it works and walks through developing one for a simple document.

eabase commented 3 years ago

Yeah, usually when people do FSM, they include a diagram with a "loopy" chart, known as a State Diagram. It would be great if we could find a tool to generate this for us...

The links from Wikipedia:

From the old SMC tutarial:

firefox_2021-05-18_10-35-02