michaelrsweet / pdfio

PDFio is a simple C library for reading and writing PDF files.
https://www.msweet.org/pdfio
Apache License 2.0
199 stars 44 forks source link

Getting Image Objects from page/Dictionary #77

Closed uddhavphatak closed 1 month ago

uddhavphatak commented 1 month ago

I cannot find any API to Get Image objects out of the page object. It would be a good addition, if we are able to get all the image objects from the page.

michaelrsweet commented 1 month ago

So this is already possible:

pdfio_file_t *pdf = pdfioFileOpen...;
pdfio_obj_t *page = pdfioFileGetPage(pdf, 42);
pdfio_dict_t *resources = pdfioDictGetDict(pdfioObjGetDict(page), "Resources");
pdfio_dict_t *images = pdfioDictGetDict(resources, "XObject");

pdfioDictIterateKeys(images, (pdfio_dict_cb_t)my_xobject_cb, pdf);

...

bool
my_xobject_cb(pdfio_dict_t *dict, const char *key, pdfio_file_t *pdf)
{
  pdfio_obj_t *image = pdfioDictGetObj(dict, key);

  ... verify the object has type "Image", then do something with the image object ...

  return (true);
}

I'm not sure it makes sense to develop a whole API just for this, since there are a lot of resources a page might use, not just images.

michaelrsweet commented 1 month ago

Follow-up: looking at the documentation, I mention that pdfioFileGetPage returns a page object, but I don't show an example of getting stuff from the page dictionary. Let's use this issue to track some documentation improvements in the manual, specifically to access the Resources dictionary, XxxBox rectangles, and Rotate angle. Maybe also provide a full list of resources from ISO 32000?

michaelrsweet commented 1 month ago

Also, PR #63 is tracking the addition of dictionary accessors (to get the number of pairs and the key at a given index) which would allow the example code to be self-contained rather than using a callback function.

uddhavphatak commented 1 month ago

As we are updating the documentation in this issue, I also had a point in updating the documentation. in function "pdfioObjGetType", could we write about the different return types it might have. Because I had to go through the code for this function to understand what this is returning. We could write an example of this function to tell what it is exactly returning when this function is called.

michaelrsweet commented 1 month ago

I've pushed changes that you can look at:

[master 74dfefd] Update documentation (Issue #77)

In the main intro documentation I added printing the CropBox and MediaBox values, with a list of common page metadata you can access (and with which functions). And in the reference sections for pdfioObjGetType and pdfioObjGetSubtype I added a list of common types and subtypes, respectively. Let me know what you think...

uddhavphatak commented 1 month ago

Hello Micheal, I am extremely sorry for the late reply from my end, due to festival season here. I just read the changes in the above documentation, they are perfect in explaining the metadata of pages in pdf.

By this change in documentation, I had an idea, that could we write another documentation which would be explaining pdf file structure as a whole. And I volunteer to write this documentation under your guidance. It would be a great help for beginners like me to find resources more about pdf document more easily.

michaelrsweet commented 1 month ago

Adding a section on the basics of PDF file structure would be a great addition to the documentation. Feel free to start a pull request with your changes and additions to the "doc/pdfio.md" file and we can collaborate from there.