pvginkel / PdfiumViewer

PDF viewer based on Google's PDFium.
Apache License 2.0
967 stars 418 forks source link

Text Extraction #18

Closed peter0302 closed 9 years ago

peter0302 commented 9 years ago

Hey guys - I needed simple text extraction and went ahead and added it in my copy. Very easy to do. The changes are:

NativeMethods.cs, NativeMethods, add:

    /* PNM */
    [DllImport("pdfium.dll")]
    public static extern int FPDFText_CountChars(IntPtr page);

    /* PNM */
    [DllImport("pdfium.dll")]
    public static extern int FPDFText_GetText(IntPtr page, int startIndex, int count, IntPtr result);

PdfFile.cs, PdfFile, add:

    /*PNM*/
    public int GetCharacterCount(int pageNumber)
    {
        using (var pageData = new PageData(_document, _form, pageNumber))
        {
            return NativeMethods.FPDFText_CountChars(pageData.TextPage);
        }
    }

    /*PNM*/
    public unsafe string GetText(int pageNumber)
    {
        using (var pageData = new PageData(_document, _form, pageNumber))
        {
            int count = NativeMethods.FPDFText_CountChars(pageData.TextPage);
            IntPtr buffer = Marshal.AllocHGlobal((count+1) * 2);
            int read = NativeMethods.FPDFText_GetText(pageData.TextPage, 0, count, buffer);
            var result = new String((char*)buffer);
            Marshal.FreeHGlobal(buffer);
            return result;
        }
    }

PdfDocument.cs, PdfDocument, add:

    /* PNM */
    public int GetCharacterCount (int page)
    {
        return _file.GetCharacterCount(page);
    }

    /* PNM */
    public string GetText (int page)
    {
        return _file.GetText(page);
    }

Just want to know if you'd prefer I fork this or if you want to incorporate it yourself. If I do fork it I'll probably add some more sophisticated extraction such as obtaining the coordinates of the text.

Also, are we still under LGPL3 or did you switch to Apache? I saw the issue was closed but the license still says LGPL3. Fine either way.

Peter

peter0302 commented 9 years ago

Sorry my copy was way outdated, I see that this has already been officially added. Cheers!

karanfil commented 7 years ago

// add to PdfDocument.cs

// returns text of whole pdf document public string GetText() { string strReturn = ""; for (int i = 0; i < PageSizes.Count; i++) { strReturn += GetText(i) + "\r\n"; } return strReturn; }