Vectorized PDF text and object extraction

gunnsth commented 5 years ago

This issue is a master issue/epic and can lead to subissues that will be referenced from here.

Proposal

The extractor package will have the capability to extract vectorized text and objects (with position and dimensions).

Goal: Extract a list of graphics objects from each PDF page.

There are three types of graphics objects:

text
path (a PDF path that has been stroked or filled)
image

Each of these objects has a

bounding box in device coordinates
color
rendering mode (fill, stroke, clip or some combination of these)
content (e.g. text)
optionally other properties
transparency?

This is not a rendering system but we hope to design it in a way that will allow it to be extended to become a renderer. Initial versions of the renderer could convert the lists of graphics objects to PDF or PostScript pages. This would provide closed-loop tests.

Definitions

text: Text objects and operators. The text operators specify the glyphs to be painted, represented by string objects whose values shall be interpreted as sequences of character codes. A text object encloses a sequence of text operators and associated parameters. (page 237)
Paragraph fragments are the largest substrings in text paragraphs that are rendered contiguously on a PDF page. If a paragraph is split between pages or columns then the parts of the paragraph that appear at the end of the first page / column and the start of the second page / column are paragraph fragments. When a paragraph fits entirely within a single column and page, the entire paragraph is a paragraph fragment.

There are at least three levels of text objects, all of which are composed of lower level (lower numbered in the following list) objects.

Text elements emitted by the renderer as a result of PDF text operators like Tj. a. A text element’s properties include the text content, location and size in device coordinates, font etc b Text elements can be used to recreate the text as it appears on the page
Paragraph fragments are created from the text elements on a page. Each paragraph fragment occupies a contiguous region on a single page. a. Paragraph fragments include the start of a paragraph that is completed on the following page / column, captions, form field labels, footnotes, etc b. The paragraph fragments in a page can be used to make inferences about the page.
Paragraphs are created from the paragraph fragments a. Paragraphs can be used to create extract the text of a PDF in plain text format

path: A path is made up of one or more disconnected subpaths, each comprising a sequence of connected segments. (page 131)

Initially we will only concern ourselves with stroked and filled paths and ignore clipping paths

// Path can define shapes, trajectories and regions of all sorts. Used to draw lines and define shapes of filled areas.
type Path struct {
    segments []lineSegments
}

// Only export if deemed necessary for outside access.
// For connected subpaths (segments), the x1, y1 coordinate will start at x2, y2 coordinate of the previous segment.
type lineSegment struct {
    isCurved bool  // Bezier curve if true, otherwise line
    x1, y1 float64
    x2, y2 float64
    cx, cy float64 // Control point (if curved)

    isNoop bool // Path ended without filling/stroking.
    isStroked bool
    strokeColor model.PdfColor
    isFilled bool
    fillColor model.PdfColor
    fillRule windingTypeRule
}

type windingNumberRule int
const (
    nonZeroWindingNumberRule windingNumberRule = iota
    evenOddWindingNumberRule 
)

image. A sampled image (or just image for short) is a rectangular array of sample values, each representing a colour. (page 203)

This should include inline images, XObject images, possibly some shadings etc. UniDoc already has a pretty good framework for this.

API ideas

Extractor

func (e *Extractor) GraphicsObjects() []GraphicsObject

type GraphicsObject interface {
// What do graphics objects have in common, or what common operations can be applied to them?
// Possibly make into a struct rather than an interface and convert to an interface if we think it makes sense.
}

Rendering Interface Ideas Renderers may need access to the graphics context to render each graphics object. Imagine a callback to emit graphics objects to a renderer (or other caller).
```
func render(o GraphicsObject, gs GraphicsState)
```

The rendering would be over all graphics objects on a page in the order they occur. This would be driven by a single processor.AddHandler() that could be configured to emit any combination of text, shape, and image objects.

func renderCore(doText, doShapes, doImages bool, render Renderer)

or rendering context/state rather than doX...

Use cases

Potential use cases that should be possible to base on this implementation:

Find text/shapes/images within a specified area.
Remove/redact text/shapes/images within a specified area.
Characterize headings, normal text.
Detect tables and inner contents
Detect mathematical formulas
PDF to markdown conversion: Requires basic heading detection, text style, tables
PDF to word/excel: Requires advanced detection of detailed features to reproduce in oxml.

Going from the primitive contentstream operands to a higher level representation, there is a need to have a connection from the higher level representation to the lower level. For example if removing content, may need to filter on a higher level basis but have a connection down to the primitive operands to actually filter those out.

There may be a cascade/sequence of processing operations, initially on the primitive operands, for example grouping.

It should be clear whether those processes are lossy or lossless, where lossless would mean that they could reproduce the exact same operands as originally and same look. Lossy would mean that some aspect was lost, for example if grouping text together, character spacing/kerning info could be lost.

Preferably all processing would have the capability to be lossless, but it remains to be seen whether that is practical.

gunnsth commented 5 years ago

@peterwilliams97 PR #256 is expected to get us through the first type: vectorized text. Correct? Does that fully include the paragraph fragments discussed?

gunnsth commented 5 years ago

Related #287 - Extractor api prepared for v3

peterwilliams97 commented 5 years ago

We have some PDF full-text search code that is unfortunately in a private repo. The idea behind it is simple. We keep track of the textMarks that were used to build each page of text. When we break up the page text in any way, we find the textMarks that correspond to the selected text and compute the enclosing bounding box. The selected text is defined by its start and end offsets in the page text: selected := text[start:end+1] and we sort the textMarks by offset so this is just 2 binary searches to find an interval of bounding boxes then taking the minimum of the bounding boxes lower lefts and the maximum of the bounding boxes upper rights over that interval.

unidoc / unipdf