pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.84k stars 271 forks source link

Get Layered PDF using pdfrw #244

Open rafayaar opened 3 months ago

rafayaar commented 3 months ago

I have few concerns:

1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data

2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here

3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that

sl2c commented 2 months ago

I have few concerns:

1- When reading pdf using PdfReader, we get reader object which when printed shows quite detailed meta data. I cant see the text content in that meta data

If by "text content" you mean the text that you see on a PDF page when it's rendered on screen then the text content, as well as anything else that might be rendered, is contained in the PDF dictionary streams. When you print an instance of PdfReader, this only prints the dictionary headers. This is because all pdfrw does is deal with the headers, with only rudimentary support for streams decompression, and no support for streams parsing. For all of that I may recommend pdfrwx — I am actively developing it at this moment

2- /Parent': {...} this shows in metadata which i dont understand why Ellipsis operator is used over here

It's a good idea to recursively print only the objects that are "below" the object being printed, otherwise you will soon run into an infinite recursion. Besides this, the __repr__() function also does not explicitely recurse into some branches that are too large to print, in order not to clutter the output.

3- I am trying to get layers of PDF, such that I can get background, text content along with font family, weight, dimensions etc. Even images, graphics and everything. Is there any way I can do that

This functionality is not available as a ready-made function in pdfrw. However, it only takes a couple of hundred lines of code to do what you want, including teaching pdfrw to parse streams. For a reference implementation, please see pdfstreamparser.py, specifically — the PdfStream class constructor, which includes exactly the options that you are looking for.