UniPDF v4 - Proposals - Githubissues

gunnsth commented 4 years ago

Introduction

The idea of this ticket is to collect all ideas that would make sense for next major version (v4). This is a chance for considering major updates.

Ideas related to performance improvements should include benchmarks that clearly show potential advantages.

Ideas for refactoring need to clearly state the advantages and also what it means for the user. If there is a breaking change, there should be an easy way to update code.

Ideas for reducing binary sizes are also welcomed. Although its important that the API remains easy to use.

Related issues

There are already a few relevant issues:

49
55
19 - For this one it might make sense to create a generalized temp storage interface that the user can provide and we can have default implementations (memory default/on disk secondary or hybrid). Recently did a similar thing in UniOffice.

gunnsth commented 4 years ago

contentstream.ContentStreamOperations should be a struct containing an array. not as currently

type ContentStreamOperations []*ContentStreamOperation

it's not fun to work with those typed slices, since iterating through an arbitrary type is kinda messy. Better to have like cs.Elements() etc. Like done with core.PdfObjectArray already. Also adds flexibility to add some extra data that can be useful to the struct.

gunnsth commented 4 years ago

With support for 1 character code <-> multiple runes (string) in CMaps, it makes sense to update our text encoder interfaces in the future. Currently we have

// TextEncoder defines the common methods that a text encoder implementation must have in UniDoc.
type TextEncoder interface {
    // String returns a string that describes the TextEncoder instance.
    String() string

    // Encode converts the Go unicode string to a PDF encoded string.
    Encode(str string) []byte

    // Decode converts PDF encoded string to a Go unicode string.
    Decode(raw []byte) string

    // RuneToCharcode returns the PDF character code corresponding to rune `r`.
    // The bool return flag is true if there was a match, and false otherwise.
    // This is usually implemented as RuneToGlyph->GlyphToCharcode
    RuneToCharcode(r rune) (CharCode, bool)

    // CharcodeToRune returns the rune corresponding to character code `code`.
    // The bool return flag is true if there was a match, and false otherwise.
    // This is usually implemented as CharcodeToGlyph->GlyphToRune
    CharcodeToRune(code CharCode) (rune, bool)

    // ToPdfObject returns a PDF Object that represents the encoding.
    ToPdfObject() core.PdfObject
}

It would make sense to have charcode <-> string, and charcode <-> string. or maybe ones that process multiples instead of single ones.

peterwilliams97 commented 4 years ago

Extractor.ExtractPageText() returns two statistics that I don't think anyone uses or will ever use. Can we replace it with a function like Extractor.Extract() (*PageText, error)?

peterwilliams97 commented 4 years ago

Text extraction is now aware of paragraph and line structure. We can therefore write a search function that returns bounding boxes of the line fragments of the matching text when the match spans multiple lines or multiple paragraphs

progamer71 commented 4 years ago

support create and manage PDF/A3 with file attachment

gunnsth commented 4 years ago

@progamer71 That is in our radar but that is not what this ticket about. This is about API compatibility and possible major changes in upcoming v4. PDF/A3 is not part of our API yet, so it is not a concern here. It would make sense to create a new issue for that, if there is not one already. And with more details as well. See https://github.com/unidoc/unipdf/issues/11

gunnsth commented 4 years ago

NewPdfFontFromTTFFile and NewCompositePdfFontFromTTFFile are a bit confusing. Users often try to use NewPdfFontFromTTFFile and then use symbols which are not in the simple encoding and does not display. It would be nice if NewPdfFontFromTTFFile could handle this, and the second function would not be needed.

gunnsth commented 4 years ago

In V4: We should change content stream processing. Currently we have

func (p *PdfPage) GetAllContentStreams() (string, error) {

which returns a string. The problem with this is that the content streams can get very big, and working with it as string leads to copying which is inefficient and memory intensive.

Creating a new type in contentstream called ContentStream to represent the content stream may be feasible where it can be worked with as a byte slice and avoid copying unless absolutely necessary.

gunnsth commented 3 years ago

Deprecate creator.Paragraph in favor of creator.StyledParagraph

gunnsth commented 3 years ago

Remove model.ImageHandling or make internal. Alternatively it could be redesigned such that it would be actually usable for providing handlers for loading images. At the moment this functionality is not well maintained and would need more testing.

StreamEncoders could be designed such that they can be registered, such that an external handler could be registered (in particular for image handling). The Decode output for images as a []byte stream (data) may not be ideal and sometimes we are loading an image and converting between models multiple times which is not efficient.

gunnsth commented 3 years ago

Text extraction should have options. Possible options:

Raw -> Just get the plain (decoded) text from the content streams. Should be very fast, and output very consistent (independent of table detection algorithms). Good for benchmarking against.
Raw sorted -> Processed to sort (top-down, left-right).
Cells/Tabular -> Apply table detection to the text and grouping text together into cells. Final output is sorted (top-down, left-right by the grouped cells (upper left coordinate of each))

gunnsth commented 3 years ago

Unify ContentStreamProcessor based on usage in render and extractor packages. Should be able to keep track of graphics and text state there in one place.

unidoc / unipdf

UniPDF v4 - Proposals #337

Introduction

Related issues

49

55

19 - For this one it might make sense to create a generalized temp storage interface that the user can provide and we can have default implementations (memory default/on disk secondary or hybrid). Recently did a similar thing in UniOffice.