sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
128 stars 13 forks source link

Writing/modifying pdfs #56

Open kskyten opened 5 years ago

kskyten commented 5 years ago

Is it possible to modify the parsed pdf and write it to a file? Specifically I'm interested in the ideas from here: open-source-ideas/open-source-ideas#46. Julia has excellent support for neural networks, so it would be interesting to experiment with something like this.

sambitdash commented 5 years ago

Both are definitely possible while first one can fit into the purview of PDFIO, the second one can be developed as a separate project that utilizes capabilities if PDFIO. PDFIO is a low level PDF reading (can be extended for manipulation) API.

There is no plans to move it to the realm of machine learning or NLP or document structure understanding.

kskyten commented 5 years ago

I agree. What should be done to support writing pdf files? Is that a large undertaking?

sambitdash commented 5 years ago

For the list given 3-6 man months depending on how much you understand PDF specification. Many of the things need document understanding which can be excluded from the list. More than development, good PDF parsers have to tested with variety of file types. That can be overwhelming.

kskyten commented 5 years ago

Unfortunately, I'm not very familiar with the PDF spec. What is the bare minimum that needs to be implemented just to write pdfs?

sambitdash commented 5 years ago

@kskyten unfortunately, without understanding the PDF specification it will be hard to write a writer particularly when you are looking at modifying page content. Moreover, writers require compression encoders which are not integrated to PDFIO only decoders are currently integrated.

Personally, writer is not very high on my priorities. While I can guide as a maintainer and owner of the library, I cannot commit on any implementation work myself.

kskyten commented 5 years ago

I was hoping I would just be able to copy the unmodified streams over and modify the lengths and references to make it work. I don't think I need a full-blown writer as I only need to modify a specific subset of streams, but I might be wrong.