izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
97 stars 45 forks source link

Parse PDF without it being on the disc #5

Open oplahcinski opened 7 years ago

oplahcinski commented 7 years ago

Hi,

I'm not great at c/c++ but I've been reading the poppler source trying learn how things work.

Do you know of any way to give the pdf to poppler if I've used something like requests to get it off of a website?

I see they have a curl loader in the source files but the files I need are behind authentication.

You being much more experienced with the source, do you think you could point me in the right direction of how something like this could be added?

Currently I do a really ugly named tempfile write and then pass that file location to pdfparser.Document. Im thinking i could use an internal redirect on nginx to hit my own endpoint but I'm not sure what overhead of all that would be compared to tempfile.

Thanks

izderadicka commented 7 years ago

Sorry not expert myself either. Never looked for possibility to stream pdf into parser.

izderadicka commented 6 years ago

@aloknayak29 open new issue as it is not related to this thread provide crashing file, exact version of libpoppler, python and platform details

HollayHorvath commented 6 years ago

Hello,

The poppler::document::load_from_raw_data could be used to implement a functionality like this since that works on char arrays.

The workflow would be something like this:

izderadicka commented 6 years ago

@HollayHorvath - I think this will not work as load_from_raw_data is part of public cpp API, but here we work with internal poppler API - need to look for something bit different:

We are using PDFDoc from internal API, which also has this constructor:

PDFDoc(BaseStream *strA, GooString *ownerPassword = NULL,
     GooString *userPassword = NULL, void *guiDataA = NULL);

and this is how it's used in in referred function:

document* document::load_from_raw_data(const char *file_data,
                                       int file_data_length,
                                       const std::string &owner_password,
                                       const std::string &user_password)
{
    if (!file_data || file_data_length < 10) {
        return nullptr;
    }

    document_private *doc = new document_private(
                                file_data, file_data_length,
                                owner_password, user_password);
    return document_private::check_document(doc, nullptr);
}

document_private::document_private(const char *file_data, int file_data_length,
                                   const std::string &owner_password,
                                   const std::string &user_password)
    : initer()
    , doc(nullptr)
    , raw_doc_data(file_data)
    , raw_doc_data_length(file_data_length)
    , is_locked(false)
{
    MemStream *memstr = new MemStream(const_cast<char *>(raw_doc_data), 0, raw_doc_data_length, Object(objNull));
    GooString goo_owner_password(owner_password.c_str());
    GooString goo_user_password(user_password.c_str());
    doc = new PDFDoc(memstr, &goo_owner_password, &goo_user_password);
}

So way to go should be to give access to the alternative PDFDoc constructor and MemStream in cython and then create MemStream from data loaded into memory in python and use it for with the constructor.