ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 592 forks source link

ocropy-gtedit changes certain punctuation and diacritic characters #341

Closed helkejaa closed 3 years ago

helkejaa commented 3 years ago

In producing and outputting files with ocropy-gtedit html and ocropy-gtedit extract, I observe that quotation marks (‘ ’ ” “ „) are changed to apostrophes and commas (' ' '' '' ,,). Further, some combining diacritics are unified and modified. For example ṣ̌ (š and ̣ = U+0161U+0323) is changed to ṣ̌ (ṣ + ̌ = U+1E63U+030C).

Expected Behavior

Ideally no characters should be modified.

Current Behavior

Characters are modified.

Possible Solution

Steps to Reproduce (for bugs)

  1. Make a test set with directory test and in it make text file 1.txt containing text ‘ ’ ” “ „ ṣ̌ and empty image 1.png.
  2. ocropus-gtedit html test/1.png -o test.html
  3. firefox test.html text has turned to ' ' '' '' ,, ṣ̌
  4. correct the text to ‘ ’ ” “ „ ṣ̌ and save the file to relevant location
  5. ocropus-gtedit extract -O test.html
  6. upon inspecting the file, text is once again ' ' '' '' ,, ṣ̌

Your Environment

helkejaa commented 3 years ago

It seems, as I thought, that this behaviour is desired. I achieved my purposes by editing https://github.com/ocropus/ocropy/blob/fe78a044691d06f769dafa5876ad552caba36c95/ocrolib/common.py#L46-L58