support different extraction modes

hhaensel commented 1 year ago

Currently, all white space characters in a textbox are merged into a single space character (' ') This makes it very difficult to extract tabular data.

Here, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.

:spaces (default) all white spaces are handled as a single space character
:tabs non-space white spaces are handled as tab characters
:boxes text between non-space white spaces is split into several textboxes with respective coordinates

For this purpose get_TextBox() no longer returns a tuple text, w, h but a vector of tuples text, w, h, offset. During evalContent!() the vector is itereated to return a TextLayout for each set of box parameters. For the modes :spaces and :tabs get_TextBox()always returns a single-element vector, whereas in:boxes` mode more than one TextLayout might be added to the output.

The :spaces mode reproduces the current extraction behavior. The :tab mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space character The :boxes mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.

@sambitdash Please comment if this sounds like a desired feature to you. If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.

sambitdash commented 1 year ago

@hhaensel I like the idea of what you are saying. But, I do not think it will work for all use cases. But, it may be working for the files you have.

What is really needed is a proper estimation of space and expand the table width to fit into various text sizes.

For example, in most cases, the table fonts are smaller than the rest of the text. One option is to decide the layout of the table, which is wider than the rest of the page, and fill in the exact number of spaces as desirable. While some fonts have a space character with width most do not. That's why this -Tj > 180 heuristics. This whole extraction needs some holistic analysis.

hhaensel commented 1 year ago

I agree completely when we consider text extraction.

The purpose of my PR is not in the first place to optimise text extraction (yet), it is rather meant to optimise the underlying vector of TextLayout elements. But in a second run, I am quite sure we could also optimise the text extraction.

Let me be a bit more precise:

For my table parsing algorithm, I decided not to rely on the full text extraction but rather on the processing of TextLayouts. Therefore I've written pdExtractLayouts(), which returns a Vector{TextLayout}. The table parsing algorithm is quite robust as long as I have both the text and its coordinates at hand for each table element. But many pdf generators place several table cells in a single textbox and place white space of a certain length in between. Hence I don't have coordinates for the next cell at hand. Therefore, in :boxes mode I split a single textbox with white spaces into several TextLayouts with their own coordinates, which can then be processed by my table algorithm.

I think that's not something very special to my documents. Even more, it should even make text extraction easier, as there is no need to handle white spaces any longer.

I admit that the algorithm gets into trouble when I process pdf documents that have been parsed by an OCR software, as such software cannot easily decide between spaces and white spaces.

sambitdash / PDFIO.jl

support different extraction modes #106