Open gunnsth opened 4 years ago
contentstream.ContentStreamOperations
should be a struct containing an array. not as currently
type ContentStreamOperations []*ContentStreamOperation
it's not fun to work with those typed slices, since iterating through an arbitrary type is kinda messy. Better to have like cs.Elements()
etc. Like done with core.PdfObjectArray already. Also adds flexibility to add some extra data that can be useful to the struct.
With support for 1 character code <-> multiple runes (string) in CMaps, it makes sense to update our text encoder interfaces in the future. Currently we have
// TextEncoder defines the common methods that a text encoder implementation must have in UniDoc.
type TextEncoder interface {
// String returns a string that describes the TextEncoder instance.
String() string
// Encode converts the Go unicode string to a PDF encoded string.
Encode(str string) []byte
// Decode converts PDF encoded string to a Go unicode string.
Decode(raw []byte) string
// RuneToCharcode returns the PDF character code corresponding to rune `r`.
// The bool return flag is true if there was a match, and false otherwise.
// This is usually implemented as RuneToGlyph->GlyphToCharcode
RuneToCharcode(r rune) (CharCode, bool)
// CharcodeToRune returns the rune corresponding to character code `code`.
// The bool return flag is true if there was a match, and false otherwise.
// This is usually implemented as CharcodeToGlyph->GlyphToRune
CharcodeToRune(code CharCode) (rune, bool)
// ToPdfObject returns a PDF Object that represents the encoding.
ToPdfObject() core.PdfObject
}
It would make sense to have charcode <-> string, and charcode <-> string. or maybe ones that process multiples instead of single ones.
Extractor.ExtractPageText()
returns two statistics that I don't think anyone uses or will ever use.
Can we replace it with a function like Extractor.Extract() (*PageText, error)
?
Text extraction is now aware of paragraph and line structure. We can therefore write a search function that returns bounding boxes of the line fragments of the matching text when the match spans multiple lines or multiple paragraphs
support create and manage PDF/A3 with file attachment
@progamer71 That is in our radar but that is not what this ticket about. This is about API compatibility and possible major changes in upcoming v4. PDF/A3 is not part of our API yet, so it is not a concern here. It would make sense to create a new issue for that, if there is not one already. And with more details as well. See https://github.com/unidoc/unipdf/issues/11
NewPdfFontFromTTFFile
and NewCompositePdfFontFromTTFFile
are a bit confusing. Users often try to use NewPdfFontFromTTFFile
and then use symbols which are not in the simple encoding and does not display.
It would be nice if NewPdfFontFromTTFFile
could handle this, and the second function would not be needed.
In V4: We should change content stream processing. Currently we have
func (p *PdfPage) GetAllContentStreams() (string, error) {
which returns a string. The problem with this is that the content streams can get very big, and working with it as string leads to copying which is inefficient and memory intensive.
Creating a new type in contentstream
called ContentStream
to represent the content stream may be feasible where it can be worked with as a byte slice and avoid copying unless absolutely necessary.
Deprecate creator.Paragraph
in favor of creator.StyledParagraph
Remove model.ImageHandling or make internal. Alternatively it could be redesigned such that it would be actually usable for providing handlers for loading images. At the moment this functionality is not well maintained and would need more testing.
StreamEncoders could be designed such that they can be registered, such that an external handler could be registered (in particular for image handling). The Decode output for images as a []byte stream (data) may not be ideal and sometimes we are loading an image and converting between models multiple times which is not efficient.
Text extraction should have options. Possible options:
Unify ContentStreamProcessor based on usage in render and extractor packages. Should be able to keep track of graphics and text state there in one place.
Introduction
The idea of this ticket is to collect all ideas that would make sense for next major version (v4). This is a chance for considering major updates.
Ideas related to performance improvements should include benchmarks that clearly show potential advantages.
Ideas for refactoring need to clearly state the advantages and also what it means for the user. If there is a breaking change, there should be an easy way to update code.
Ideas for reducing binary sizes are also welcomed. Although its important that the API remains easy to use.
Related issues
There are already a few relevant issues:
49
55
19 - For this one it might make sense to create a generalized temp storage interface that the user can provide and we can have default implementations (memory default/on disk secondary or hybrid). Recently did a similar thing in UniOffice.