unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.54k stars 249 forks source link

Vectorized PDF text and object extraction #35

Open gunnsth opened 5 years ago

gunnsth commented 5 years ago

This issue is a master issue/epic and can lead to subissues that will be referenced from here.

Proposal

The extractor package will have the capability to extract vectorized text and objects (with position and dimensions).

Goal: Extract a list of graphics objects from each PDF page.

There are three types of graphics objects:

Each of these objects has a

This is not a rendering system but we hope to design it in a way that will allow it to be extended to become a renderer. Initial versions of the renderer could convert the lists of graphics objects to PDF or PostScript pages. This would provide closed-loop tests.

Definitions

There are at least three levels of text objects, all of which are composed of lower level (lower numbered in the following list) objects.

  1. Text elements emitted by the renderer as a result of PDF text operators like Tj. a. A text element’s properties include the text content, location and size in device coordinates, font etc b Text elements can be used to recreate the text as it appears on the page
  2. Paragraph fragments are created from the text elements on a page. Each paragraph fragment occupies a contiguous region on a single page. a. Paragraph fragments include the start of a paragraph that is completed on the following page / column, captions, form field labels, footnotes, etc b. The paragraph fragments in a page can be used to make inferences about the page.
  3. Paragraphs are created from the paragraph fragments a. Paragraphs can be used to create extract the text of a PDF in plain text format

Initially we will only concern ourselves with stroked and filled paths and ignore clipping paths

// Path can define shapes, trajectories and regions of all sorts. Used to draw lines and define shapes of filled areas.
type Path struct {
    segments []lineSegments
}

// Only export if deemed necessary for outside access.
// For connected subpaths (segments), the x1, y1 coordinate will start at x2, y2 coordinate of the previous segment.
type lineSegment struct {
    isCurved bool  // Bezier curve if true, otherwise line
    x1, y1 float64
    x2, y2 float64
    cx, cy float64 // Control point (if curved)

    isNoop bool // Path ended without filling/stroking.
    isStroked bool
    strokeColor model.PdfColor
    isFilled bool
    fillColor model.PdfColor
    fillRule windingTypeRule
}

type windingNumberRule int
const (
    nonZeroWindingNumberRule windingNumberRule = iota
    evenOddWindingNumberRule 
)

This should include inline images, XObject images, possibly some shadings etc. UniDoc already has a pretty good framework for this.

API ideas

func (e *Extractor) GraphicsObjects() []GraphicsObject

type GraphicsObject interface {
// What do graphics objects have in common, or what common operations can be applied to them?
// Possibly make into a struct rather than an interface and convert to an interface if we think it makes sense.
}

The rendering would be over all graphics objects on a page in the order they occur. This would be driven by a single processor.AddHandler() that could be configured to emit any combination of text, shape, and image objects.

func renderCore(doText, doShapes, doImages bool, render Renderer)

or rendering context/state rather than doX...

Use cases

Potential use cases that should be possible to base on this implementation:

  1. Find text/shapes/images within a specified area.
  2. Remove/redact text/shapes/images within a specified area.
  3. Characterize headings, normal text.
  4. Detect tables and inner contents
  5. Detect mathematical formulas
  6. PDF to markdown conversion: Requires basic heading detection, text style, tables
  7. PDF to word/excel: Requires advanced detection of detailed features to reproduce in oxml.

Going from the primitive contentstream operands to a higher level representation, there is a need to have a connection from the higher level representation to the lower level. For example if removing content, may need to filter on a higher level basis but have a connection down to the primitive operands to actually filter those out.

There may be a cascade/sequence of processing operations, initially on the primitive operands, for example grouping.

It should be clear whether those processes are lossy or lossless, where lossless would mean that they could reproduce the exact same operands as originally and same look. Lossy would mean that some aspect was lost, for example if grouping text together, character spacing/kerning info could be lost.

Preferably all processing would have the capability to be lossless, but it remains to be seen whether that is practical.

gunnsth commented 5 years ago

@peterwilliams97 PR #256 is expected to get us through the first type: vectorized text. Correct? Does that fully include the paragraph fragments discussed?

gunnsth commented 5 years ago

Related #287 - Extractor api prepared for v3

peterwilliams97 commented 5 years ago

We have some PDF full-text search code that is unfortunately in a private repo. The idea behind it is simple. We keep track of the textMarks that were used to build each page of text. When we break up the page text in any way, we find the textMarks that correspond to the selected text and compute the enclosing bounding box. The selected text is defined by its start and end offsets in the page text: selected := text[start:end+1] and we sort the textMarks by offset so this is just 2 binary searches to find an interval of bounding boxes then taking the minimum of the bounding boxes lower lefts and the maximum of the bounding boxes upper rights over that interval.