sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Text extraction ignores different kinds of white spaces #107

Open hhaensel opened 1 year ago

hhaensel commented 1 year ago

Currently, all white space characters in a textbox are merged into a single space character (' ') This makes it very difficult to extract tabular data.

In #106, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.

For this purpose get_TextBox() no longer returns a tuple text, w, h but a vector of tuples text, w, h, offset. During evalContent!() the vector is itereated to return a TextLayout for each set of box parameters. For the modes :spaces and :tabs get_TextBox()always returns a single-element vector, whereas in:boxes` mode more than one TextLayout might be added to the output.

The :spaces mode reproduces the current extraction behavior. The :tab mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space character The :boxes mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.

@sambitdash Please comment if this sounds like a desired feature to you. If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.