sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

problem extracting text on a two columns layout #112

Closed manentai closed 8 months ago

manentai commented 8 months ago

I have opened a question here: https://discourse.julialang.org/t/how-to-extract-data-from-pdf-with-two-columns/108008 but maybe this is the right place...

from this slide

using this:

doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
npage = pdDocGetPageCount(doc)
io = IOBuffer()
for i = 1:npage
    page = pdDocGetPage(doc, i)
    pdPageExtractText(io, page)
end
pdDocClose(doc)

I get this: gives a poor performance: ● SAM: The solution ○ EU incubators / accelerators: ~1200 market size ○ EU VC: ~500 ● SOM: ○ IT incubators / ● TAM:○ 1.35M tech startups accelerators: ~250, worldwide ○ IT VCs: ~60 ○ ~7000 incubators / accelerators worldwide, ○ proxies to enter and counting 5000 startups ○ ~2500 VC firms, half of them on growth giano.rocks

which is kind of ignoring the two columns layout. I am trying to understand from the API if there's a way to simply extract all the objects in the page and then extract texts from these objects. What am I missing?

sambitdash commented 8 months ago

17 and #2 are enhancements recorded on similar requirements. Hence, closing this issue as a duplicate.

manentai commented 8 months ago

thanks for the pointers, since pdPageEvalContent from #17 does not seem to exist anymore, is pdPageGetContentObjects the right one to get the structure now? if this is the case, how to iterate over the PDPageObjectGroup returned?

sambitdash commented 8 months ago

https://github.com/sambitdash/PDFIO.jl/blob/826552f6460513b15e15b15b2505bb2e400117ed/src/PDPage.jl#L11 is exported and is very much available. However, this method is a callback and depends on the PDF page object type. It is not documented and can only be used by someone who understands PDF specifications well enough to work with the callback.

manentai commented 8 months ago

sorry, I cannot find it here: https://docs.juliahub.com/PDFIO/cmOJE/0.1.14/, and when I try to use it I get UndefVarError:pdPageEvalContent` not defined

Stacktrace: [1] top-level scope`

manentai commented 8 months ago

https://github.com/sambitdash/PDFIO.jl/blob/826552f6460513b15e15b15b2505bb2e400117ed/src/PDPage.jl#L11

is exported and is very much available. However, this method is a callback and depends on the PDF page object type. It is not documented and can only be used by someone who understands PDF specifications well enough to work with the callback.

ok fair enough I guess...

sambitdash commented 8 months ago
function pdPageExtractText(io::IO, page::PDPage)
    state = pdPageEvalContent(page)
    show_text_layout!(io, state)
    return io
end

pdPageExtractText calls pdPageEvalContent that populates the content objects in the state object. show_text_layout! does the actual layout translation. It has no understanding of multi-column text. It just tries to map the text from left to right and top to bottom. PDF as a format does not understand columns. However, you can take the content objects and render the text objects using your heuristics. That is the suggestion given in #17.