ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

Swapped text blocks position #113

Open PedroBarcha opened 8 years ago

PedroBarcha commented 8 years ago

I've notice that OCRopus frequently swap the position of the texts recognized. In the folowing image, for instance, "LIBRARY" and "May 27,1993" were correctly recognized, but their position isn't correct in the output: 8501_001

OCROpus output: UNIVERSITY OF NEVADA, LAS VEGAS 45O5 MARYLAND PARKWAY LAS VEGAS, NEVADA 89154-7001 (702) 7393286 YUCHENG LIU 5613 S EASTERN AVE LAS VEGAS NV 89119 DEAR LIU: LiBRARY May 27, 1993 We are pleased to inform you that you have been selected for a carrel assignment for the fall semester. Please come to the Circulation Desk to pick up your key and get your room assignment at the beginning of the fall semester. If you are already currently assigned to a carrel and have a key, you may be able to stay where you are. If I do need to move you for any reason, I will let you know. For instance, if your carrel happens to be one of the ones scheduled to have the locks removed, I will have to change your carrel assignment. Sincerely, --) . ----'e - AC Sheila Beard Circulation Section

PedroBarcha commented 8 years ago

Why does it happen???

Another example ("September 20, 1993" is in the wrong position): 8520_001

OCRopus output: MY. Kevin Grover Programmer University of Nevada Las Vegas i505 South Maryland Parkway Las Vegas, NV 89154 Dear Mr. Grover: w 0 b September 20, 1993 Do you want to hear some EgS9 news? '-, ., : sD - Twenty- three months ago we scheduled a very special person, Dr. Denis Waitley, on our calendar. If you know of him, you will probably want to skip the rest of this letter and simply call for your tickets to the November 16 University of Nevada, Las Vegas ''Lessons in Leadership'' program here in Las Vegas. If you don't yet know of Denis Waitley, look at what others have said: ''Denis Waitley has always been one step ahead of all of us. . his new program, he leapfrogs even further. . all of us. . . With . . Denis is a mentor for . . This is special stuff1''--Pat Riley, Head Coach of the New York Knicks ''Denis Waitley' s teaching transforms employees into entrepreneurs, coaches and managers into leaders, and individuals into champions. His newest program wins the gold medal as his best ever.''- -Harvey Mackay, author of How to Swim with the Sharks ''Denis Waitley is not only a top quality speaker, he is a top quality person. His message is both practical and truly inspirational.''- - Stephen R. Covey, author of The 7 Habits of Hi hl Effective Peo le and Princi le-Centered Leadershi Dr. Denis Waitley is a man with credentials that simply do not stop. A graduate of the U .S. Naval Academy, former Navy pilot, and holder of a Ph.D. in human behavior, he is best known as the author and narrator of ''The Psychology of Winning,'' the all-time best selling audio cassette album on personal and professional development. Waitley has studied and counseled organizations and individuals in every walk of life, from ''Fortune 500'' companies, NASA's astronauts, returning POWs and foreign hostages, to Super Bowl and Olympic athle tes. ''Breakthrough research,'' ''piercing truth,'' ''in-depth insights,'' ''marvelous word pictures,'' ''cutting edge relevancy,'' ''real life examples and humor,'' ''a common sense approach to uncommon success''... these are some of the phrases that describe why Denis Waitley- -as heard on his ''The Psychology of Winning'' audiotapes- -is known as the man with ''the most listened- to voice in the world'' (outside of entertainment and media broadcasts .) You will see what we're Offce of the Dean Etended Education 4505 Maryland Parkway v Box 451039 Las Vegas, Nevada 89154-1019

zuphilip commented 8 years ago

I don't know why exactly the reading order is incorrect in your examples. The rules for the reading order can be found here https://github.com/tmbdev/ocropy/blob/master/ocrolib/psegutils.py#L122 . I will try to look closer into this...

What you can do is to calculate the hocr format with ocropus-hocr and then reorder the lines according to their second coordinate in the bbox.

tmbdev commented 8 years ago

This probably happens due to the logic for multi-column text recognition: the page segmenter interprets your document as a multicolumn document and then rearranges the text accordingly. Addressing that issue fully automatically is tricky because it depends on document genre (book vs letter) and high level semantic knowledge that isn't available to the page segmenter.

Since you know that these are single column documents, you can use the "--maxcolseps=0 --maxseps=0" arguments to ocropus-gpageseg to avoid the multicolumn interpretation.

zuphilip commented 8 years ago

Isn't the column recognition only influencing whether to split a larger line into several smaller lines? As soon as all lines are detected then the reading order of the lines is calculated independent of any previously detected columns or arguments. (BTW I think I have an improvement to the reading-order-algorithm, which I would like to share and discuss, as soon as it is finished.)

tmbdev commented 8 years ago

The reading order depends on the geometric relations between lines. If you split lines into multiple lines, the reading order can change because the algorithm preferentially goes down within a column first.

On Wed, Oct 12, 2016, 12:56 Philipp Zumstein notifications@github.com wrote:

Isn't splitting into lines only influencing whether to split a larger line into several smaller lines? As soon as all lines are detected then the reading order of the lines is calculated https://github.com/tmbdev/ocropy/blob/master/ocropus-gpageseg#L389 independent of any previously detected columns or arguments. (BTW I think I have an improvement to the reading-order-algorithm, which I would like to share and discuss, as soon as it is finished.)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tmbdev/ocropy/issues/113#issuecomment-253320390, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUYP8kT0cgs5-ftyM0-wukRvy6_3HKnks5qzTtagaJpZM4KQmBF .

zuphilip commented 8 years ago

Okay, I see. However, here in the example the step for calculating the reading order will partially fail, e.g.

unbenannt 1

When comparing the two lines LIBRARY and DEAR LIU: the current algorithm will determine that DEAR LIU: comes before LIBRARY.