In producing and outputting files with ocropy-gtedit html and ocropy-gtedit extract, I observe that quotation marks (‘ ’ ” “ „) are changed to apostrophes and commas (' ' '' '' ,,). Further, some combining diacritics are unified and modified. For example ṣ̌ (š and ̣ = U+0161U+0323) is changed to ṣ̌ (ṣ + ̌ = U+1E63U+030C).
Expected Behavior
Ideally no characters should be modified.
Current Behavior
Characters are modified.
Possible Solution
Steps to Reproduce (for bugs)
Make a test set with directory test and in it make text file 1.txt containing text ‘ ’ ” “ „ ṣ̌ and empty image 1.png.
ocropus-gtedit html test/1.png -o test.html
firefox test.html text has turned to ' ' '' '' ,, ṣ̌
correct the text to ‘ ’ ” “ „ ṣ̌ and save the file to relevant location
ocropus-gtedit extract -O test.html
upon inspecting the file, text is once again ' ' '' '' ,, ṣ̌
Your Environment
Python version: Python 2.7.18
Git revision of ocropy: fe78a044691d06f769dafa5876ad552caba36c95
Operating System and version: Linux 5.14.2-arch1-2
In producing and outputting files with
ocropy-gtedit html
andocropy-gtedit extract
, I observe that quotation marks (‘ ’ ” “ „) are changed to apostrophes and commas (' ' '' '' ,,). Further, some combining diacritics are unified and modified. For example ṣ̌ (š and ̣ = U+0161U+0323) is changed to ṣ̌ (ṣ + ̌ = U+1E63U+030C).Expected Behavior
Ideally no characters should be modified.
Current Behavior
Characters are modified.
Possible Solution
Steps to Reproduce (for bugs)
test
and in it make text file1.txt
containing text‘ ’ ” “ „ ṣ̌
and empty image1.png
.ocropus-gtedit html test/1.png -o test.html
firefox test.html
text has turned to' ' '' '' ,, ṣ̌
‘ ’ ” “ „ ṣ̌
and save the file to relevant locationocropus-gtedit extract -O test.html
' ' '' '' ,, ṣ̌
Your Environment