tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.84k stars 429 forks source link

Is there any way to extract only a particular columns without specifying the area but with the column name? #229

Open rakshitcgupta opened 6 years ago

criztovyl commented 6 years ago

What's your use-case?

rakshitcgupta commented 6 years ago

I want only debit and credit columns from the bank statements

criztovyl commented 6 years ago

Tabula is for extracting only, I think this is not possible at the moment.

But it should be easy to post-proccess the Tabula result so you get only the columns you want.

rakshitcgupta commented 6 years ago

I did the column extraction in the post-process in the CSVWriter class, but thats only when tabula has extracted all the table. I actually want to specify the columns in the pre-processing phase which can reduce the complexity. So, I wanted to know where exactly is the header extraction/detection is done in the code.

criztovyl commented 6 years ago

I'm not sure where that code is atm, let's see if I can find it.

criztovyl commented 6 years ago

A quick look didn't bring up what you want. If I understand correctly, Tabula does not even have a concept of headings - they're just the first line of the table.

On a side note, why do you think that specifying the column names you want beforehand reduces complexity? I would say it even increases complexity, if my statement about Tabula having no concept of headers is correct.

Or are you scared about memory consumption? Here I would say that reading a CSV can be done line-by-line and most CSV parsers most likely won't consume that much memory.

Sorry for the discussion; I'm just trying to help with what I know. :)

rakshitcgupta commented 6 years ago

First of all, I have gone through the Nurminen algorithm for detecting table which is used in the code https://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3 There is a mention of the header detection in the thesis but in the code there is no header detection algorithms implemented.

Yeah I understand that this repository does not even have a concept of headings - they're just the first line of the table. I wanted help with finding the first line of the table.

criztovyl commented 6 years ago

Okay, then I can't help you, sorry.

But I saw #230 earlier, so I thought this issue here was about getting the data from a PDF and the other one about header detection in Tabula.

rakshitcgupta commented 6 years ago

Yeah, I just want help with finding the first line of the table in the code so that I can manipulate the code to extract particular columns there itself.

thubamamba commented 1 year ago

Did you ever get around to this @rakshitcgupta?

rakshitcgupta commented 12 months ago

Did you ever get around to this @rakshitcgupta?

I dont even remember. Its been 5 years. 😞