Open manuelsongokuh opened 8 years ago
see example similar add lines:
Hi!
This certainly would be an useful feature - as far as I can see however tesseract (the OCR engine used by gImageReader) does not have table extraction support built in. Perhaps you could do some research on whether this is really the case, and if so whether some algorithms are already available to do the job?
ok, but for me is not needed for algorithms, because table (lines:horizontal, vertical) like as area for scan ocr in group A
example -3 lines verticals and 3 horizontals = 9 cells, AS 9 groups. -9 groups add ID numbers (A1,A2,A3,B1,B2,B3,C1,C2,C3) -OCR starts scanning from A1, finish scanning and save in text XXX>> -OCR starts scanning from A2, finish scanning and save in text XXX>>YYY>> -OCR starts scanning from A3, finish scanning and save in text XXX>>YYY>>ZZZ
-OCR starts scanning from B1, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>> -OCR starts scanning from B2, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>> -OCR starts scanning from B3, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ
-OCR starts scanning from C1, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ##newline##XXX>> -OCR starts scanning from C2, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ##newline##XXX>>YYY -OCR starts scanning from C3, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ
OCR finish all: text result: XXX>>YYY>>ZZZ XXX>>YYY>>ZZZ XXX>>YYY>>ZZZ
-save in CSV -open libreoffice calc -libreoffice opens file CSV -there is dialog for import txt (csv), active option seperate: TABULATION, click ok. -result perfect table in CELLS.. this is can Do it?
note: ">>"= tabulation
this can works command to awk or sed for processing text background..?
for me not needed algorithms..my think..
Well the problem is how to detect the table cell areas from the image, if I understand correctly what you are saying.
ah, maybe i write bad english, i try i think not needed for automatic (robot) for "DETECT", i know this is long time for code, impossible! but there is easy to use: gimagereader has rectangle-selection (area for scan OCR), that is ok, but when there is table in page so YOU (or people) need create manually rectangle-selection and add line horizzontal inside of rectangle-selection, and add line vertical inside of rectangle-selection, so this is table create from handle manually, and starts to scan OCR that is goal..
but not automatic to "detect" ok.. this is logn time, evoid.
instead for add line in "area", this is possible for short time..ok?
did you try okular of KDE? there is name "select table" and add lines manually (H,V)
i did done table my table finance personal, i do area 1 for column1 and 2 area for column2, area 3 for column3, start scan and save in txt and i open libreoffice calc, move from under to new column. because gimagereader done OCR and in TXT: XXX XXX XXX
YYY YYY YYY
ZZZ ZZZ ZZZ
so i move yyy and zzz to 2column and 3column of libreoffice calc..
so i think gimagereader can do this area table manually and save time a lot..
okular kde4 is OK, you can try and understand that i said same ok, it's easy for me to use okular area table, but not OCR..smile.. so gimagereader can does it :+1: :+1: :+1: :+1: :+1:
Aha ok I see what you mean, yes if the user defines the table geometry manually then it is definitely easier.
note i did done my 6 pages tables with gimagereader is perfect! but long time for move texts, in libreoffice..
i hope gimagereader will help me my 200 pages finance personal..
ah perfect! i will wait you for area tables geometry GO GO GO GIMAGEREADER!
sorry, i'm crazy to know: when will add little feature for lines in area "the table geometry manually" like as okular..? but what is name feature? (you can change my title of issue to tittle correct for feature ok?
i want to know when will release or milestone?
i'm love to use gimagereader..but i will wait for my 200 pages.. thank you
Given that gImageReader is purely a spare time project, it really depends on how I'm doing spare time-wise. I'm currently (finally) finishing up an initial implementation for an hOCR editor with PDF generation support, then I'll look at this. Clearly, if you have some knowledge with coding, I always welcome contributions.
ok. me sorry , i'm not programer.. me sorry... if i'm progromer i will can help you..me sorry..
Never too late to learn ;) Anyways, perhaps I'll manage by the end of the month. I'll tell you when there is something to test, before I'll release a new version.
i find a phrase : "One interesting tool is the Table selection, which allows you to select a rectangular area, and then divide it into rows and columns. Text selected this way will be available for pasting with rows delimited by newlines and columns delimited by two tab characters." this is OKULAR.
i can to help you to find a information for coding little similar ok?
I'm familiar with how okular works, so it is pretty much a matter of just coding the implementation.
i dont know if it is helpful for you? http://tex.stackexchange.com/questions/279846/split-the-selection-area-of-two-columns-tabular-or-minipage-or-whatever-works
https://github.com/KDE/okular/search?utf8=%E2%9C%93&q=table
http://stackoverflow.com/questions/488089/extracting-tables-from-pdf-files-programmatically
I've also used tabula in some cases. There is an effort to combine tabula with tesseract to do exactly this.
I'll be following these two repositories fairly closely from here on out!
This feature sounds funny, I want to implement it. But as we know that table have many many variable forms, then how do we to detect column line and row line of difference table?
Okular has a table tool which can serve as inspiration (i.e. it requires the user to mark the row and column boundaries). It should also be possible to autodetect them with a smart algorithm. But the main problem is what to do with the result IMO. It doesn't fit into the workflow currently.
Leptonica has some new table detection features - please see https://github.com/DanBloomberg/leptonica/search?q=table&type=Commits&utf8=%E2%9C%93
Currently working on hOCR editor with pdf export... [4 years ago] Is that project finished?
I wonder if the new hOCR features in Tesseract 5 can help in creating tables?
BTW if we can save the OCRed file as HTML or rtf, we can open it in LibreOffice and convert it in table. If the original table has merged cells or nested tables, we can create that in LibreOffice.
I wonder if the new hOCR features in Tesseract 5 can help in creating tables?
Do you have any reference in regard to point out?
BTW if we can save the OCRed file as HTML or rtf, we can open it in LibreOffice and convert it in table.
The hOCR format is actually HTML, which you can save as such You can also export to ODT.
gImageReader is indeed amazing, but this issue with tables is essential for most non-trivial real-world OCR use cases.
Tesseract can't and won't be able to handle anything but very basic tables anytime soon, as even the best table algorithms can only do basic tables. Frankly, I don't even see an AI being able to handle all types of tables without user interaction (not for many years, still).
Therefore, the GUI solution is essential.
I used Abby Finereader for many years and it did have both a table-recognition algorithm (good for about 75% of tables), and the ability to define a table selection box, add horizontal and vertical lines to it, merge and split cells, and generate the whole table in a Word format for output. It was very good.
That is a very advanced solution, and more of a wish list for gImageReader at this point.
But, Manisandro, here is how I would approach this in gImageReader:
Phase I
Phase II
Phase III
Phase IV
Keep up the awesome work, and thank you!!!
hello this GIMAGEREADER IS AMAZING!!! i need feature:
because i will copy from table in clipboard and open libreoffice CALC and paste OR output to CSV..