wanglongqi / pdf2djvu

Automatically exported from code.google.com/p/pdf2djvu
0 stars 2 forks source link

add option for joining parts of hyphenated words #84

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1. some text contain word hyphenation (part of word come to next line)
add removing of soft hyphens similar to djvuocr beta rc2.4
that useful for full text search - hyphens do not allow to find such splitted 
words

Original issue reported on code.google.com by mivan...@gmail.com on 27 Apr 2013 at 11:36

GoogleCodeExporter commented 9 years ago
From <http://www.djvu-soft.narod.ru/soft/djvuocr_en.htm>:

   The idea […] is to avoid the problem when a hyphenated word is split
   into two parts, and cannot be found when performing search in DJVU
   files. For example:

   "this function is int-"
   "egrable on an interval..."

   The word "integrable" cannot be found by searching, only the pieces
   of it, "int" and "egrable". The new method is to repeat the entire
   word in the OCR text, […]:

   "this function is "
   "integrable on an interval..."

I suppose we could do that, although we would have to lie about coordinates of 
the hyphenated word.

Original comment by jwilk@jwilk.net on 21 Nov 2013 at 9:48

GoogleCodeExporter commented 9 years ago
set coordiate to start of hypheated word - that correct
end - may be omitted

Original comment by mivan...@gmail.com on 4 Jul 2014 at 3:27