pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
64 stars 2 forks source link

Real-time updated content in PDF files #186

Closed ghost closed 2 years ago

ghost commented 2 years ago

Describe the bug

  1. Real-time updated content in PDF files. Because PDF files do not make it possible for the information to be updated. I believe the PDF file format is a great format, but if there was up-to-date information within the PDF file it would be better.
  2. This could be done if the PDF file called a networking module of the operating system itself to execute http calls and thus have the content updated.

Provide a recommendation for correction.

Additional context

  1. I wish I had a way to reuse PDF files, usually PDF files consume a lot of disk space
  2. This can be unsafe - could be a malware port
  3. Imagine that you need to generate an updated pdf file every time - why, instead of generating an updated pdf file every time there is some trivial change, we can't just update the file itself without necessarily having to keep generating separate files with every trivial change ?

References

  1. https://www.computerworld.com/article/2475900/the-adobe-reader-consumes-mass-quantities-of-hard-drive-space.html
  2. https://www.bouldercounty.org/environment/recycle/reduce-reuse-recycle/
  3. https://help.liferay.com/hc/en-us/articles/360044412992-How-to-resolve-Document-Library-PDF-previews-taking-up-more-disk-space-than-the-original-files
  4. https://support.microsoft.com/en-us/windows/tips-to-free-up-drive-space-on-your-pc-4d97fc4a-0175-8d49-ac2f-bcf27de46d34
  5. https://www.recordnations.com/articles/file-format-space/
  6. http://www.customlivingsolutions.com/digitized-files-how-much-hard-drive-space-do-they-take-up
  7. https://www.onelegal.com/blog/5-ways-to-reduce-the-size-of-a-pdf/
  8. https://blog.flipsnack.com/5-ways-to-reduce-the-size-of-a-pdf-file/
MatthiasValvekens commented 2 years ago

First off, I'm afraid this isn't an actionable erratum, i.e. there's no "broken" feature in the standard to fix. So this probably isn't the best place to discuss.

Putting on my iText hat: this is something we've thought about in the past, and there are various challenging aspects to this problem. I won't go into detail right here, but suffice to say that this is a minefield that needs to be navigated very carefully. :)

Also, there might be a relevant talk on that topic at the upcoming PDF Days in September... ;).

lrosenthol commented 2 years ago

As @MatthiasValvekens said, and I will put in a different way, this is not a bug but an actual feature of PDF.

PDF files represent (to the world at large) a "document" - which makes sense since that is what the "D" stands for ;). User do not expect documents to change when they are viewed - either live while open or between opens. Not only that, but most don't expect them to go anywhere near a network - hence why many viewers put up warning dialogs when a PDF tries to "phone home".

This is, of course, quite different from what users expect on the web - where content is entirely ephemeral and there it entirely based online and communication is core.

ghost commented 2 years ago

Hi all! @lrosenthol @MatthiasValvekens How are you? So... thank you for feedback! I hope to clarify some things...

  1. "User do not expect documents to change when they are viewed - either live while open or between opens." - My goal is not users but to help companies "to better manage their PDF files".
  2. "So this probably isn't the best place to discuss." - Sorry for the inconvenience. =(
  3. "Putting on my iText hat: this is something we've thought about in the past, and there are various challenging aspects to this problem. I won't go into detail right here, but suffice to say that this is a minefield that needs to be navigated very carefully. :)" - I would like to know but I don't think it will give me details anyway thanks for the feedback.
ghost commented 2 years ago

Hi all! I hope to clarify some things here too...

1. Initial considerations and/or general notes

  1. There are several companies that automate PDF files.
  2. There is currently no standardization on how to automate PDF files. Therefore, each company creates its own standard to automate PDF files. Which is good on the one hand and bad on the other.

2.1 That's good in that sense

  1. The company owns the technology and not another company
  2. It is something that the company does according to its own technology and staff, which is best managed.

2.2 It's bad in that sense:

  1. Creating your own standard to automate PDF files is a lot of work and, over time, things get obsolete, old either because this technology no longer has a customer or a viable product, or because the company no longer wants to finance the project.
  2. That is, technology that the same company creates to standardize the automation of PDF files is no longer suitable for another company because it is done in a different way and that another company may not want to adopt.
  3. And as I said before, over time the technology that the same company that creates a way to standardize file automation becomes obsolete if this technology no longer has a customer or a viable product.

2.3 Possible solution

  1. If there was a community repository where there were people who create standards to automate PDF files, different companies could use the same code base. This makes it easier, for example, to you create their own software following a more correct structure in terms of software development.
  2. What exists in the world today is undoubtedly RESTFull architecture, which is a client-server architecture. It is quite easy to use rest calls to create or automate pdf files.
  3. My proposal here would be to have a restfull open api to create, manage pdf files in a way that each company uses it in the best way.

4. Final considerations

  1. Updating the contents of the pdf file can be done through restfull calls which is a relatively safe standard.
  2. It's just an idea, concept. I would like to know what you all think of the idea.
  3. PDF is a very common file format for businesses. It would be interesting to have a restfull api to improve the way PDF is used in file automation such as invoice etc.
  4. I'm not promoting or wanting to promote any product, company, person, technology, service, business. My purpose in referencing the links is for bibliography only.
  5. In my partial view, the best way to make this possible would be when a PDF document is generated - an file is generated automatically. It is a json file that contains the document information.

5. Sample

["paragraph",
 {"style":"bold",
  "size":10,
  "family":"halvetica",
  "color":[0, 255, 221]},
 "Lorem ipsum dolor sit amet, consectetur adipiscing elit."]

6. Initial problems

  1. I don't want to use third-party libraries
  2. I would like to use a PDF Association community restful standard
7. Concept

image

8. References

lrosenthol commented 2 years ago

@raphael-louis-andress So you aren't talking about the content of a PDF being updated after it is distributed, but instead how one could manage an internal repository of documents. That would seem to be the province of Document Management Systems (as their name implies). But a DMS doesn't do document creation (in most cases, though there are exceptions in specialized fields) - such creation (or updating) must be done outside using tools such as the ones that you suggest.

And PDF creation is going to be based on the type of content that is being authored, with very different content layout and formatting needs. Consider that the content authoring functionality of a word processing program (ala MSWord or Google Docs) is completely different from that of a spreadsheet (ala MSExcel or Google Sheets) and even more so from specialized authoring tools such as AutoCAD or MuseScore. So trying to define such a system that would be "generic" seems impractical.

If you are looking for REST APIs for PDF creation. whether from markup languages such as JSON, XML or HTML or from existing authoring formats, you will find a number of them offered by members of the PDF Association.

ghost commented 2 years ago

@lrosenthol Hi! Thank you for feedback. I appreciate your comment, you were quick to respond. So... that's why I'm happy, you're amazing... It's very kind of you to read this entire message I wrote, so thank you again.

1. Context

  1. I'm looking for ways to make this feature possible, so I thought of a Restfull API for this. However, your feedback showed me that I was wrong.
  2. Recently I was reading how PDF works, all the steps of its working and how its file format also works and is done etc. reference here: https://resources.infosecinstitute.com/topic/pdf-file-format-basic-structure/ and here: https://en.wikipedia.org/wiki/PDF_Association

2. Questions

  1. Is this the main organization of the PDF file format?
  2. Are there other organizations of the PDF file format?
  3. Can you help me to clarify this question too?
  4. @MatthiasValvekens Hi! when you say: "Putting on my iText hat:" ... do you mean: https://itextpdf.com/en?

2.1 Context of Doubt

Curiosity. I would like to know more about the PDF format.

3. References

lrosenthol commented 2 years ago

PDF is standardized as ISO 32000. Part 1 is for PDF 1.7 and Part 2 is for PDF 2.0. Those documents are the official documentation on the PDF file format.

ghost commented 2 years ago

@lrosenthol Thank you for feedback again ;D please, close this issue. everything has been resolved - you clarify all doubts. So... really, this feature doesn't make any sense.

MatthiasValvekens commented 2 years ago

Regarding this:

@MatthiasValvekens Hi! when you say: "Putting on my iText hat:" ... do you mean: https://itextpdf.com/en?

I work for iText. What I meant is that within the iText research team we've played around with the idea of "self-updating" PDF. I wanted to stress that that's completely orthogonal to anything in the PDF standard, since this issue tracker is really about resolving issues with the current spec. I should've made that distinction more clear.

Anyway, since you seem to be satisfied with the answers you got, I'll go ahead and close this, as you requested. :)