Open rakshitcgupta opened 6 years ago
I want only debit and credit columns from the bank statements
Tabula is for extracting only, I think this is not possible at the moment.
But it should be easy to post-proccess the Tabula result so you get only the columns you want.
I did the column extraction in the post-process in the CSVWriter class, but thats only when tabula has extracted all the table. I actually want to specify the columns in the pre-processing phase which can reduce the complexity. So, I wanted to know where exactly is the header extraction/detection is done in the code.
I'm not sure where that code is atm, let's see if I can find it.
A quick look didn't bring up what you want. If I understand correctly, Tabula does not even have a concept of headings - they're just the first line of the table.
On a side note, why do you think that specifying the column names you want beforehand reduces complexity? I would say it even increases complexity, if my statement about Tabula having no concept of headers is correct.
Or are you scared about memory consumption? Here I would say that reading a CSV can be done line-by-line and most CSV parsers most likely won't consume that much memory.
Sorry for the discussion; I'm just trying to help with what I know. :)
First of all, I have gone through the Nurminen algorithm for detecting table which is used in the code https://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3 There is a mention of the header detection in the thesis but in the code there is no header detection algorithms implemented.
Yeah I understand that this repository does not even have a concept of headings - they're just the first line of the table. I wanted help with finding the first line of the table.
Okay, then I can't help you, sorry.
But I saw #230 earlier, so I thought this issue here was about getting the data from a PDF and the other one about header detection in Tabula.
Yeah, I just want help with finding the first line of the table in the code so that I can manipulate the code to extract particular columns there itself.
Did you ever get around to this @rakshitcgupta?
Did you ever get around to this @rakshitcgupta?
I dont even remember. Its been 5 years. 😞
What's your use-case?