pdf-rs / pdf

Rust library to read, manipulate and write PDF files.
MIT License
1.29k stars 120 forks source link

How can I get a page and save to another file? #99

Open zemelLeong opened 3 years ago

zemelLeong commented 3 years ago

Like this.

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_input = PdfFileReader(open("test.pdf", 'rb'))

pdf_output = PdfFileWriter()
page = pdf_input.getPage(2)
pdf_output.addPage(page)

pdf_output.write(open("./splitted.pdf", 'wb'))
s3bk commented 3 years ago

Constructing PDFs is very much under construction. Take a look at https://github.com/pdf-rs/pdf/blob/master/examples/content/src/main.rs and use page.content instead.

Note that so far no cleanup is done. It just writes another trailer to the existing data.

zemelLeong commented 3 years ago

Hope it can be read a page as a stream.

import asyncio
from PyPDF2 import PdfFileReader, PdfFileWriter

async def sender():
    _, writer = await asyncio.open_connection('127.0.0.1', 8888)

    old_write = writer.write
    writer.length = 0

    def write(data):
        writer.length += len(data)
        old_write(data)

    def tell():
        return writer.length

    writer.tell = tell
    writer.write = write

    pdf_input = PdfFileReader(open("original.pdf", 'rb'))

    pdf_output = PdfFileWriter()
    page = pdf_input.getPage(5)

    pdf_output.addPage(page)

    pdf_output.write(writer)

asyncio.run(sender())
s3bk commented 3 years ago

No. PDFs need to be there entirely. Technically there exists an extension that allows processing partial PDFs, but that would require a much more complex architecture.

zemelLeong commented 3 years ago

我可能表达得不准确,我是希望被读取的一页能够转换为字节数组以便于在网络中传输。我在pdf-rspypdf2中有找到相似的代码。 My expression may not be accurate. I want the page read to be converted into a byte array for transmission over the network. I found similar code in pdf-rs and pypdf2.

image

s3bk commented 3 years ago

for now you can add a save_to_vec here: https://github.com/pdf-rs/pdf/blob/master/pdf/src/file.rs#L261

    pub fn save_to_vec(&mut self, path: impl AsRef<Path>) -> Result<Vec<u8>> {
        self.storage.save(&mut self.trailer)?)
    }

Note that the output still contains all original data, so it will not be smaller.

zemelLeong commented 3 years ago

I use this file to test this example. The generated file display is blank. The other files have the same issue.

#[cfg(test)]
mod pdf_test {
    use pdf::content::{Op, Point};
    use pdf::{build::PageBuilder, content::Content, file::File};
    use pdf::build::CatalogBuilder;

    macro_rules! file_path {
        ( $sub_dir:expr ) => { concat!("./src/test/common/", $sub_dir) }
    }

    macro_rules! run {
        ($e:expr) => (
            match $e {
                Ok(v) => v,
                Err(e) => {
                    e.trace();
                    panic!("{}", e);
                }
            }
        )
    }

    #[test]
    pub fn write_pages() {
        let mut file = run!(File::<Vec<u8>>::open(file_path!("xelatex.pdf")));
        let mut pages = Vec::new();
        for page in file.pages().take(1) {
            let page = page.unwrap();
            if let Some(ref c) = page.contents {
                println!("{:?}", c);
            }

            let content = Content::from_ops(vec![
                Op::MoveTo { p: Point { x: 100., y: 100. } },
                Op::LineTo { p: Point { x: 100., y: 200. } },
                Op::LineTo { p: Point { x: 200., y: 100. } },
                Op::LineTo { p: Point { x: 200., y: 200. } },
                Op::Close,
                Op::Stroke,
            ]);
            pages.push(PageBuilder::from_content(content));
        }
        let catalog = CatalogBuilder::from_pages(pages)
            .build(&mut file).unwrap();

        file.update_catalog(catalog).unwrap();

        file.save_to(file_path!("modify.pdf")).unwrap();
    }
}

image

zemelLeong commented 3 years ago

Open modify.pdf got an error.

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Try { file: "pdf\\src\\file.rs", line: 277, column: 23, source: FromPrimitive { typ: "RcRef < Catalog >", field: "
root", source: Try { file: "pdf\\src\\file.rs", line: 94, column: 19, source: FromPrimitive { typ: "PagesRc", field: "pages", source: Try { file: "pdf\\src\\object\\types.rs", line: 90,
column: 20, source: UnexpectedPrimitive { expected: "Reference", found: "Dictionary" } } } } } }', examples\content\src\main.rs:12:49
s3bk commented 3 years ago

Yea, I ran into the same problem. This should be fixed now. Try running cargo update (or git pull if you have a local repo).

zemelLeong commented 3 years ago

Rewritten content it seems that missing some info.

#[cfg(test)]
mod pdf_test {
    use pdf::content::{Op, Point};
    use pdf::{build::PageBuilder, content::Content, file::File};
    use pdf::build::CatalogBuilder;

    macro_rules! file_path {
        ( $sub_dir:expr ) => { concat!("./src/test/common/", $sub_dir) }
    }

    macro_rules! run {
        ($e:expr) => (
            match $e {
                Ok(v) => v,
                Err(e) => {
                    e.trace();
                    panic!("{}", e);
                }
            }
        )
    }

    #[test]
    pub fn write_pages() {
        let mut file = run!(File::<Vec<u8>>::open(file_path!("xelatex.pdf")));

        let mut pages = Vec::new();
        // for page in file.pages().take(1) {
        //     let page = page.unwrap();
        //     if let Some(ref c) = page.contents {
        //         println!("{:?}", c);
        //     }

        //     let content = Content::from_ops(vec![
        //         Op::MoveTo { p: Point { x: 100., y: 100. } },
        //         Op::LineTo { p: Point { x: 100., y: 200. } },
        //         Op::LineTo { p: Point { x: 200., y: 100. } },
        //         Op::LineTo { p: Point { x: 200., y: 200. } },
        //         Op::Close,
        //         Op::Stroke,
        //     ]);
        //     pages.push(PageBuilder::from_content(content));
        // }

        // for page in file.pages() {
        //     if let Some(ref contents) = page.unwrap().contents {
        //         let content = Content::from_ops(contents.operations.to_vec());
        //         pages.push(PageBuilder::from_content(content));
        //     }
        // }

        for page in file.pages().take(2) {
            let content = page.unwrap().contents.clone().unwrap();
            pages.push(PageBuilder::from_content(content));
        }

        let catalog = CatalogBuilder::from_pages(pages)
            .build(&mut file).unwrap();

        file.update_catalog(catalog).unwrap();

        file.save_to(file_path!("modify.pdf")).unwrap();
    }
}

image

zemelLeong commented 3 years ago

This method worked.

#[cfg(test)]
mod pdf_test {
    use pdf::content::{Op, Point};
    use pdf::{build::PageBuilder, content::Content, file::File};
    use pdf::build::CatalogBuilder;

    macro_rules! file_path {
        ( $sub_dir:expr ) => { concat!("./src/test/common/", $sub_dir) }
    }

    macro_rules! run {
        ($e:expr) => (
            match $e {
                Ok(v) => v,
                Err(e) => {
                    e.trace();
                    panic!("{}", e);
                }
            }
        )
    }

    #[test]
    pub fn write_pages() {
        let mut file = run!(File::<Vec<u8>>::open(file_path!("xelatex.pdf")));

        let mut pages = Vec::new();

        for page in file.pages().take(2) {
            if let Ok(ref page) = page {
                pages.push(PageBuilder::from_page(page).unwrap());
            }
        }

        let catalog = CatalogBuilder::from_pages(pages)
            .build(&mut file).unwrap();

        file.update_catalog(catalog).unwrap();

        file.save_to(file_path!("modify.pdf")).unwrap();
    }
}