pdf-rs / pdf

Rust library to read, manipulate and write PDF files.
MIT License
1.29k stars 120 forks source link

How do I extract `PageLabel` form pdf? #210

Closed Atreyagaurav closed 9 months ago

Atreyagaurav commented 9 months ago

I see that there is a datatype [PageLabel](https://docs.rs/pdf/latest/pdf/object/struct.PageLabel.html#) in the library. But I can't figure out any way to read it from PDF. I know the PDF has that, as I can see it if I convert the PDF into text editor friendly format and open it. And also Beamer created pdfs have those.

Edit: Also to add more general question, how do I extract data from the Stream.

The case for PageLabels is something like this:

%% Object stream: object 17, index 15; original object ID: 2148
<<
  /Metadata 1548 0 R
  /Names 16 0 R
  /OpenAction 386 0 R
  /Outlines 1537 0 R
  /PageLabels <<
    /Nums [
      0
      <<
        /P <feff0031>
      >>
      1
      <<
        /P <feff0032>
      >>
      3
      <<
        /P <feff0033>
      >>
      4
      <<
        /P <feff0034>
      >>
      6
      <<
        /P <feff0035>
      >>
      7
      <<
        /P <feff0036>
      >>
      8
      <<
        /P <feff0037>
      >>
      11
      <<
        /P <feff0038>
      >>
      13
      <<
        /P <feff0039>
      >>
      16
      <<
        /P <feff00310030>
      >>
      17
      <<
        /P <feff00310031>
      >>
      18
      <<
        /P <feff00310032>
      >>
      21
      <<
        /P <feff00310033>
      >>
      23
      <<
        /P <feff00310034>
      >>
      24
      <<
        /P <feff00310035>
      >>
      25
      <<
        /P <feff00310036>
      >>
      26
      <<
        /P <feff00310037>
      >>
      27
      <<
        /P <feff00310038>
      >>
      28
      <<
        /P <feff00310039>
      >>
      29
      <<
        /P <feff00320030>
      >>
      30
      <<
        /P <feff00320031>
      >>
      31
      <<
        /P <feff00320032>
      >>
      32
      <<
        /P <feff00320033>
      >>
      33
      <<
        /P <feff00320034>
      >>
      34
      <<
        /P <feff0031>
      >>
      36
      <<
        /P <feff0032>
      >>
      37
      <<
        /P <feff0033>
      >>
      38
      <<
        /P <feff0034>
      >>
      39
      <<
        /P <feff0035>
      >>
      40
      <<
        /P <feff0036>
      >>
      41
      <<
        /P <feff0037>
      >>
      42
      <<
        /P <feff0038>
      >>
    ]
  >>
  /PageMode /UseOutlines
  /Pages 1536 0 R
  /Type /Catalog
>>
endstream
endobj
s3bk commented 9 months ago

Looks like it is the catalog. https://docs.rs/pdf/latest/pdf/file/struct.File.html#method.get_root And pagelabels needs to be added there.

Atreyagaurav commented 9 months ago

I see the catalog, but everything there is Ref, I can get some Stream from the root but I want to know how can I get the information from there programmatically. Because it just says Ref for everything.

s3bk commented 9 months ago

Ref can be dereferenced with the resolver. file.resolver().get(ref)

I added page_labels to the Catalog.

Atreyagaurav commented 9 months ago

Ref can be dereferenced with the resolver.

Yes, but I get more Ref (or PlainRef), how do I know what kind of data it has and how to convert it into usable data? Debug printing just gives this. Support I want to search for PageLabels manually, looking at object streams, all I get are these. With even if I get inner from there, I have no idea what data type it's supposed to be.

RcRef { inner: PlainRef { id: 5807, gen: 0 }, data: () }

I added page_labels to the Catalog.

I don't see any commits, where can I try that.

Some examples there could be useful. Getting custom tags from PDF or things like that.

Also, for now I went with poppler-rs for my program now as it seems to give the page labels, although I had to get it for each page instead of from the document itself.

Atreyagaurav commented 9 months ago

This is a sample code I tried:

use std::path::PathBuf;

use pdf;
use pdf::object::Resolve;

fn main() {
    let path = PathBuf::from("/path/to/slides.pdf");
    let file = pdf::file::FileOptions::cached().open(path).unwrap();
    println!(
        "{:?}",
        file.resolver()
            .get(file.get_root().metadata.unwrap())
            .unwrap()
    );
}
s3bk commented 9 months ago

Oops. I didn't check the terminal again after hitting return. If you are working with PlainRefs, you just have to fetch them and see what it actually is. Resolver::resolve, I think would be the function to call.

s3bk commented 9 months ago

To read the Metadata field, again, resolver::get and then call data() on the stream you got.

s3bk commented 9 months ago

Well, the code I added is incorrect.

Atreyagaurav commented 9 months ago

Yeah, I saw it's added but it doesn't extract the info.

println!("{:#?}", file.get_root().page_labels); Gives me this:

Some(
    NameTree {
        limits: None,
        node: Intermediate(
            [],
        ),
    },
)
s3bk commented 9 months ago

It is working as of 5c19ff6a7e040a6a83edcaad5f7110d31705fd20. See the end of examples/names.rs for an example.

Atreyagaurav commented 9 months ago

Thank you. It works. Looks like beamer page numbers are saved as prefix, so I did something like this:

    if let Some(ref labels) = catalog.page_labels {
        labels.walk(&resolver, &mut |page: i32, label| {
            println!(
                "{page} -> {:?}",
                label.prefix.as_ref().unwrap().to_string_lossy()
            );
        });
    }