jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

Added output_page fn #63

Open JuniFruit opened 12 months ago

JuniFruit commented 12 months ago

Description

Added output_page pub function similar to output_doc, but now you can output specified page.

Reason

If you do page by page scanning and you need to save pages with a specified search term, you can use this function to output text on a given page. Example:

pub fn find_pages_with_term(doc_file: &Document, term: &str) -> Vec<(u32, (u32, u16))> {
    let pages = doc_file.get_pages();
    let mut res = vec![];
    for (p, id) in pages {
        println!("Looking for {} in page {}", term, p);
        let text = page_to_text(&doc_file, &p).unwrap_or("".to_string());
        println!("{:?}", text);
        if text.find(term).is_some() {
            res.push((p, id))
        }
    }
    res
}

pub fn page_to_text(doc_file: &Document, page: &u32) -> Result<String, OutputError> {
    use pdf_extract::{output_page, OutputError, PlainTextOutput};
    let mut s = String::new();
    let mut output = PlainTextOutput::new(&mut s);
    output_page(doc_file, &mut output, page)?;
    Ok(s.clone())
}

Testing

All tests are passing, no breaking changes