koreader / koreader

An ebook reader application supporting PDF, DjVu, EPUB, FB2 and many more formats, running on Cervantes, Kindle, Kobo, PocketBook and Android devices
http://koreader.rocks/
GNU Affero General Public License v3.0
16.29k stars 1.24k forks source link

Special characters are shown mistakenly in some .txt files #7000

Open fxsmthn opened 3 years ago

fxsmthn commented 3 years ago

Issue

Greetings, I noticed that some files that are in .txt-format are not showing special characters such as äöå correctly - instead, those letters are being replaced with what appears to be cyrillic letters, as seen on this screenshot:

annaliisa-doesntwork

Steps to reproduce

The file can be downloaded here (it is public domain, so no copyright infringement worries): http://www.lonnrot.net/kirjat/0062.zip

The special characters are shown correctly by another open source reader for Android, Librera PRO and it is also shown correctly if opened in a browser (I tested using F-droid fennec) or the built-in Android HTML viewer or using Pocketbook's default reader (although Pocketbook's default reader has some other issues with the book, namely, HUGE font for some reason, but it does show the äö-characters correctly at least).

You can copy paste the entire book into Google Docs and export as a txt-file again, then the letters will show correctly in KOReader (exporting as .epub also works), but I'd like to know what causes the issue and if it is perhaps just some setting that I have set incorrectly in KOReader. You can also use EbookConverter app from Play store.

Here is the same book after exporting it through Docs:

annaliisa-works

Another type of .txt-file that shares the issue, is exporting articles as "Generic / text only" by using the built-in Windows feature, as described here: https://dfarq.homeip.net/how-to-add-a-generic-printer-in-windows-10

Here is an example article with special characters: https://perustuslakiblogi.wordpress.com/2020/12/13/lauri-koskentausta-lyhyt-katsaus-suomen-perustuslain-valvontajarjestelman-historiaan-miten-nykymalliin-on-tultu/

And here is how it shows in KOReader after being "printed":

plb-doesntwork1

And here is the converted text:

plb-works

Here also, doing the extra step through Docs, EbookConverter app etc. will convert the text as correct.

Apologies if this has been mentioned already (I tried to check both open and closed issues and didn't see this specific issue mentioned yet).

I suppose, since there is a workaround, this is not a huge priority, but it would be nice if it is an easy fix for the next release! :-)

crash.log (if applicable)

I'm not sure if crashlog is applicable, since it doesn't crash, it just shows those characters as incorrect, but if crashlog is needed, I'll try to provide one.

NiLuJe commented 3 years ago

It's not in UTF-8, it's in Latin1 (AKA ISO-8859-15).

I'm not sure there's anything in place to detect that in TXT, because there's no metadata, so the safe & sane default in 2020 (and, like, in the last 20 years if we're being honest ;p) is to assume UTF-8 when in doubt ;).

poire-z commented 3 years ago

Just tested your 0062.txt, and I got like you did. It's a thing with "txt" files: they can't tell in which encoding the text is.

So, your 0062.txt is actually in one of the "latin" encoding (1 byte per character) - and we can't really know which one of them (https://en.wikipedia.org/wiki/ISO/IEC_8859#The_parts_of_ISO/IEC_8859) it is. It's possible crengine (our EPUB/HTML/TXT renderer), from russian origin, picks the latin/cyrillic one (as it has to select how to interpret a byte), while your text may be in the latin/nordic one. We don't have a way to select one from the UI.

If you convert it to UTF-8 (1 or more bytes per character, but with no ambiguity), it will render corectly in KOReader. You can do this conversion in any good text editor, or on linux: $ iconv --f latin1 -t utf8 0062.txt > 0062utf8.txt

Some tools may be able to auto-select an encoding from the UI languages or other languate detection heuristics, and may be crengine (our EPUB/HTML/TXT renderer) has some way to do that and fail... But nobody really cares about txt files to investigate that. I'd say we're actually pretty lucky that it works fine with UTF-8 :) which is the encoding most content made today should be in.

NiLuJe commented 3 years ago

Yeah, disambiguating character sets from thin air is not for the faint of heart, c.f., https://github.com/chardet/chardet for a venerable Python solution ;).

fxsmthn commented 3 years ago

Alright, thanks for the thorough and informative replies, I now learned a lot more about encoding. 👍

I found out it is very trivial to convert the encoding used by the text file with Notepad++ on the Windows side (it shows the original book as just "ANSI" and converting to UTF-8 or UTF-8-BOM from the encoding settings menu creates a file that properly shows special characters in KOReader) and I'll keep that Linux tip in mind also. In fact, it is so easy, I'm kind of disappointed I didn't figure this out myself. A lot faster than running all these ANSI text files through Google Docs, that is for certain. :-D

On the Android side, I was able to find an app called "Text Encoding Converter" from the Play store that is fast and works like a charm. I have a feeling the Linux trick mentioned by poire-z might be possible on Android using Termux or some sufficiently feature complete text editor, but I wasn't too lucky with those that I tried.

I suppose this issue can be closed now and perhaps tagged as "wontfix" or "cantfix"? Since, as you guys have explained, it would be quite difficult to support these latin1/ANSI-encoded text files with crengine and not worth the time considering how rare they appear to be compared to UTF-8 and how trivial it is to convert said files to UTF-8.

poire-z commented 3 years ago

Thank you for your (good) undersdanding :) Let's keep it open - low priority, but may be we should at least try to have it use latin1 instead of the latin-cyrillic it seems to use (as it seems it manages to detect it's not UTF-8).

Frenzie commented 3 years ago

You can do this conversion in any good text editor, or on linux: $ iconv --f latin1 -t utf8 0062.txt > 0062utf8.txt

Fwiw, we already build libiconv.

poire-z commented 3 years ago

It turns out crengine really tries to detect the encoding of text files. When they are not 7bits ASCII, and invalid UTF8: https://github.com/koreader/crengine/blob/8254cef9d99238f92ebe0a534f41a66d2e631b7c/crengine/src/crtxtenc.cpp#L2013-L2047

With one of my french latin1 text:

codepage try 0     cp1251   ru : 0.007583 0.001301 0.001212 - 0.006848 0.000102 0.000102  :  0.210203
codepage try 1     cp1251   ru : 0.007230 0.002519 0.001055 - 0.006588 0.000082 0.000082  :  0.188432
codepage try 2     cp1251   ru : 0.006519 0.011559 0.001015 - 0.006879 0.000549 0.000094  :  0.193519
codepage try 3      cp866   ru : 0.007665 0.000533 0.000443 - 0.007051 0.000023 0.000023  :  0.069538
codepage try 4      koi8r   ru : 0.007767 0.000111 0.000021 - 0.007089 0.000000 0.000000  :  0.002819
codepage try 5      koi8r   ru : 0.007776 0.000037 0.000034 - 0.007152 0.000004 0.000004  :  0.006282
codepage try 6       utf8   ru : 0.007794 0.000045 0.000005 - 0.007224 0.000004 0.000000  :  0.000624
codepage try 7     cp1251   ru : 0.007582 0.001234 0.001207 - 0.006846 0.000101 0.000101  :  0.209224
codepage try 8      cp866   ru : 0.007669 0.000474 0.000447 - 0.007069 0.000018 0.000018  :  0.068089
codepage try 9      koi8r   ru : 0.007771 0.000056 0.000029 - 0.006998 0.000001 0.000001  :  0.004515
codepage try 10     cp1251   bg : 0.007616 0.001501 0.001500 - 0.006978 0.000231 0.000231  :  0.300361
codepage try 11      cp866   bg : 0.007635 0.000564 0.000564 - 0.007122 0.000023 0.000023  :  0.085920
codepage try 12      koi8r   bg : 0.007737 0.000059 0.000058 - 0.007161 0.000001 0.000001  :  0.008250
codepage try 13     cp1250   cs : 0.002567 0.048800 0.000240 - 0.004521 0.004375 0.000016  :  0.081272
codepage try 14     cp1250   pl : 0.003036 0.044535 0.000030 - 0.004825 0.003264 0.000000  :  0.007741
codepage try 15     cp1250   pl : 0.002934 0.046146 0.000028 - 0.004661 0.003567 0.000000  :  0.007333
codepage try 16     cp1252   fr : 0.000504 0.066915 0.000356 - 0.001143 0.010463 0.000062  :  0.657102
codepage try 17     cp1252   fr : 0.000492 0.066632 0.000377 - 0.001216 0.010334 0.000075  :  0.705874
codepage try 18      cp850   fr : 0.000707 0.066559 0.000000 - 0.001239 0.010401 0.000000  :  0.000000
codepage try 19      cp850   fr : 0.000492 0.066632 0.000377 - 0.001216 0.010334 0.000075  :  0.705874
codepage try 20     cp1252   de : 0.001921 0.064524 0.000000 - 0.003881 0.006752 0.000000  :  0.000118
codepage try 21     cp1252   de : 0.001892 0.063735 0.000000 - 0.003721 0.006268 0.000000  :  0.000054
codepage try 22      cp850   de : 0.001923 0.064524 0.000000 - 0.003825 0.006751 0.000000  :  0.000000
codepage try 23      cp850   de : 0.001893 0.063749 0.000000 - 0.003660 0.006269 0.000000  :  0.000000
codepage try 24     cp1252   es : 0.001602 0.062368 0.000082 - 0.002985 0.007630 0.000006  :  0.043039
codepage try 25     cp1252   es : 0.001773 0.060909 0.000072 - 0.003083 0.007255 0.000003  :  0.032900
codepage try 26      cp850   es : 0.001633 0.062291 0.000000 - 0.002916 0.007624 0.000000  :  0.000000
codepage try 27      cp850   es : 0.001801 0.060844 0.000000 - 0.002982 0.007255 0.000000  :  0.000000
codepage try 28       utf8   ee : 0.002231 0.057928 0.000001 - 0.004363 0.005203 0.000000  :  0.000201
codepage try 29      cp775   ee : 0.001751 0.059954 0.000000 - 0.003660 0.005699 0.000000  :  0.000012
codepage try 30     cp1253   gr : 0.007158 0.005451 0.001618 - 0.006655 0.000441 0.000065  :  0.262342
codepage try 31      cp737   gr : 0.007302 0.004040 0.000194 - 0.006818 0.000378 0.000011  :  0.032069
codepage try 32     cp1257  lit : 0.002352 0.051058 0.000043 - 0.004078 0.004740 0.000000  :  0.013241
codepage try 33     cp1257  lat : 0.002449 0.052704 0.000052 - 0.004119 0.005096 0.000018  :  0.032533
codepage try 34     cp1250   sr : 0.002337 0.055806 0.000064 - 0.004064 0.005205 0.000003  :  0.022398
codepage try 35     cp1252   it : 0.001576 0.062724 0.000055 - 0.002924 0.007239 0.000012  :  0.040327
codepage try 36     cp1254   tr : 0.002801 0.048634 0.000003 - 0.004249 0.003962 0.000000  :  0.000930
codepage try 37        gbk   zh : 0.007401 0.001518 0.000155 - 0.005341 0.000002 0.000002  :  0.025208
codepage try 38        gbk   zh : 0.007189 0.002829 0.000231 - 0.005188 0.000093 0.000001  :  0.037938
codepage try 39  shift_jis   ja : 0.006790 0.005554 0.000381 - 0.005963 0.000225 0.000000  :  0.059685
codepage try 40  shift_jis   ja : 0.006378 0.009013 0.000343 - 0.005927 0.000814 0.000000  :  0.055746
codepage try 41      eucjp   ja : 0.007252 0.003281 0.000228 - 0.006042 0.000234 0.000000  :  0.034374
codepage try 42      eucjp   ja : 0.006704 0.007714 0.000200 - 0.005986 0.000811 0.000000  :  0.031451
codepage try 43       big5   ja : 0.006258 0.009917 0.000132 - 0.005062 0.000672 0.000000  :  0.023398
codepage try 44       big5   zh : 0.006968 0.004042 0.000077 - 0.005535 0.000000 0.000000  :  0.012238
codepage try 45     euc_kr   ko : 0.007662 0.000250 0.000250 - 0.006558 0.000005 0.000005  :  0.037062
codepage try 46     cp1252   de : 0.001845 0.064983 0.000000 - 0.003702 0.006801 0.000000  :  0.000030
codepage try 47       utf8   de : 0.001896 0.063546 0.000000 - 0.003663 0.006672 0.000000  :  0.000030
codepage try 48 iso8859-16   ro : 0.001684 0.057592 0.000008 - 0.003500 0.006448 0.000000  :  0.003207
codepage try 49     cp1250   ro : 0.001738 0.056600 0.000008 - 0.003547 0.006294 0.000000  :  0.003098
codepage try 50       utf8   ro : 0.001993 0.052606 0.000003 - 0.003663 0.005891 0.000000  :  0.001136
codepage try 51     cp1250   cz : 0.002625 0.046620 0.000247 - 0.004174 0.004019 0.000019  :  0.089789
codepage try 52     cp1250   hu : 0.002481 0.051299 0.000606 - 0.003838 0.004453 0.000049  :  0.238483
codepage try 53     cp1250   hu : 0.002166 0.052004 0.000477 - 0.003561 0.004860 0.000033  :  0.201680
Detected codepage:cp1252 lang:fr index:17

With some german text (converted from utf8 to latin1) :/

codepage try 0     cp1251   ru : 0.007743 0.000199 0.000102 - 0.007199 0.000000 0.000000  :  0.013710
codepage try 1     cp1251   ru : 0.007270 0.001630 0.000085 - 0.006918 0.000000 0.000000  :  0.011977
codepage try 2     cp1251   ru : 0.006893 0.009003 0.000086 - 0.007247 0.000202 0.000000  :  0.012167
codepage try 3      cp866   ru : 0.007789 0.000098 0.000001 - 0.007254 0.000000 0.000000  :  0.000085
codepage try 4      koi8r   ru : 0.007777 0.000103 0.000006 - 0.007254 0.000000 0.000000  :  0.000853
codepage try 5      koi8r   ru : 0.007797 0.000004 0.000002 - 0.007333 0.000000 0.000000  :  0.000225
codepage try 6       utf8   ru : 0.007801 0.000045 0.000000 - 0.007462 0.000002 0.000000  :  0.000000
codepage try 7     cp1251   ru : 0.007749 0.000130 0.000102 - 0.007200 0.000000 0.000000  :  0.013656
codepage try 8      cp866   ru : 0.007795 0.000029 0.000001 - 0.007265 0.000000 0.000000  :  0.000171
codepage try 9      koi8r   ru : 0.007788 0.000032 0.000004 - 0.007262 0.000000 0.000000  :  0.000538
codepage try 10     cp1251   bg : 0.007776 0.000039 0.000039 - 0.007299 0.000000 0.000000  :  0.005180
codepage try 11      cp866   bg : 0.007799 0.000002 0.000002 - 0.007331 0.000000 0.000000  :  0.000266
codepage try 12      koi8r   bg : 0.007799 0.000003 0.000003 - 0.007331 0.000000 0.000000  :  0.000332
codepage try 13     cp1250   cs : 0.003077 0.048785 0.000000 - 0.005240 0.003473 0.000000  :  0.000000
codepage try 14     cp1250   pl : 0.003085 0.045893 0.000000 - 0.005160 0.002956 0.000000  :  0.000000
codepage try 15     cp1250   pl : 0.003029 0.047761 0.000000 - 0.005060 0.003125 0.000000  :  0.000000
codepage try 16     cp1252   fr : 0.002329 0.067357 0.000000 - 0.004403 0.006762 0.000000  :  0.000000
codepage try 17     cp1252   fr : 0.002293 0.067026 0.000000 - 0.004328 0.006853 0.000000  :  0.000000
codepage try 18      cp850   fr : 0.002329 0.067357 0.000000 - 0.004392 0.006762 0.000000  :  0.000000
codepage try 19      cp850   fr : 0.002293 0.067026 0.000000 - 0.004328 0.006853 0.000000  :  0.000000
codepage try 20     cp1252   de : 0.001092 0.075341 0.000036 - 0.002595 0.010601 0.000004  :  0.025788
codepage try 21     cp1252   de : 0.001032 0.074184 0.000034 - 0.002420 0.010088 0.000003  :  0.024430
codepage try 22      cp850   de : 0.001144 0.075305 0.000000 - 0.002600 0.010597 0.000000  :  0.000000
codepage try 23      cp850   de : 0.001083 0.074168 0.000000 - 0.002419 0.010089 0.000000  :  0.000000
codepage try 24     cp1252   es : 0.002561 0.061894 0.000000 - 0.004568 0.005941 0.000000  :  0.000033
codepage try 25     cp1252   es : 0.002687 0.060089 0.000000 - 0.004601 0.005795 0.000000  :  0.000032
codepage try 26      cp850   es : 0.002561 0.061901 0.000000 - 0.004569 0.005940 0.000000  :  0.000000
codepage try 27      cp850   es : 0.002687 0.060094 0.000000 - 0.004506 0.005796 0.000000  :  0.000000
codepage try 28       utf8   ee : 0.003117 0.056159 0.000000 - 0.005091 0.003884 0.000000  :  0.000000
codepage try 29      cp775   ee : 0.002540 0.059604 0.000007 - 0.004630 0.004544 0.000000  :  0.001841
codepage try 30     cp1253   gr : 0.007330 0.004040 0.000092 - 0.006919 0.000274 0.000000  :  0.012964
codepage try 31      cp737   gr : 0.007380 0.003960 0.000000 - 0.007008 0.000274 0.000000  :  0.000000
codepage try 32     cp1257  lit : 0.003054 0.047792 0.000000 - 0.005144 0.003537 0.000000  :  0.000000
codepage try 33     cp1257  lat : 0.003414 0.047560 0.000000 - 0.005416 0.003154 0.000000  :  0.000000
codepage try 34     cp1250   sr : 0.002675 0.055005 0.000000 - 0.005114 0.003727 0.000000  :  0.000000
codepage try 35     cp1252   it : 0.002418 0.062492 0.000000 - 0.004625 0.005663 0.000000  :  0.000000
codepage try 36     cp1254   tr : 0.002616 0.053040 0.000111 - 0.004494 0.005320 0.000004  :  0.034468
codepage try 37        gbk   zh : 0.007449 0.001324 0.000023 - 0.005531 0.000000 0.000000  :  0.003586
codepage try 38        gbk   zh : 0.007277 0.002649 0.000018 - 0.005369 0.000052 0.000000  :  0.002895
codepage try 39  shift_jis   ja : 0.006939 0.004915 0.000009 - 0.006147 0.000200 0.000000  :  0.001338
codepage try 40  shift_jis   ja : 0.006498 0.008778 0.000008 - 0.006119 0.000736 0.000000  :  0.001259
codepage try 41      eucjp   ja : 0.007371 0.003166 0.000019 - 0.006220 0.000231 0.000000  :  0.002828
codepage try 42      eucjp   ja : 0.006804 0.007761 0.000011 - 0.006173 0.000730 0.000000  :  0.001663
codepage try 43       big5   ja : 0.006171 0.010296 0.000015 - 0.005230 0.000600 0.000000  :  0.002717
codepage try 44       big5   zh : 0.006817 0.003816 0.000011 - 0.005715 0.000000 0.000000  :  0.001761
codepage try 45     euc_kr   ko : 0.007756 0.000044 0.000043 - 0.006755 0.000000 0.000000  :  0.005975
codepage try 46     cp1252   de : 0.001043 0.075470 0.000035 - 0.002467 0.010261 0.000003  :  0.024360
codepage try 47       utf8   de : 0.001178 0.073765 0.000000 - 0.002544 0.010064 0.000000  :  0.000000
codepage try 48 iso8859-16   ro : 0.002327 0.058578 0.000000 - 0.004434 0.005517 0.000000  :  0.000000
codepage try 49     cp1250   ro : 0.002375 0.057568 0.000000 - 0.004450 0.005393 0.000000  :  0.000000
codepage try 50       utf8   ro : 0.002577 0.053511 0.000000 - 0.004591 0.005049 0.000000  :  0.000000
codepage try 51     cp1250   cz : 0.003098 0.045919 0.000000 - 0.004869 0.003255 0.000000  :  0.000000
codepage try 52     cp1250   hu : 0.002978 0.053120 0.000048 - 0.004432 0.004559 0.000008  :  0.019075
codepage try 53     cp1250   hu : 0.002506 0.053803 0.000032 - 0.004153 0.004763 0.000004  :  0.013512
Detected codepage:cp1254 lang:tr index:36

With some english text with a single accentuated word système:

codepage try 0     cp1251   ru : 0.007794 0.000075 0.000000 - 0.004180 0.000000 0.000000  :  0.000000
codepage try 1     cp1251   ru : 0.007246 0.001570 0.000000 - 0.004150 0.000000 0.000000  :  0.000000
codepage try 2     cp1251   ru : 0.006632 0.009085 0.000000 - 0.004715 0.000410 0.000000  :  0.000000
codepage try 3      cp866   ru : 0.007793 0.000075 0.000000 - 0.004162 0.000000 0.000000  :  0.000000
codepage try 4      koi8r   ru : 0.007793 0.000075 0.000000 - 0.004162 0.000000 0.000000  :  0.000000
codepage try 5      koi8r   ru : 0.007802 0.000003 0.000000 - 0.004182 0.000000 0.000000  :  0.000000
codepage try 6       utf8   ru : 0.007800 0.000032 0.000000 - 0.003773 0.000002 0.000000  :  0.000000
codepage try 7     cp1251   ru : 0.007800 0.000024 0.000000 - 0.004160 0.000000 0.000000  :  0.000000
codepage try 8      cp866   ru : 0.007800 0.000024 0.000000 - 0.004146 0.000000 0.000000  :  0.000000
codepage try 9      koi8r   ru : 0.007800 0.000024 0.000000 - 0.004149 0.000000 0.000000  :  0.000000
codepage try 10     cp1251   bg : 0.007804 0.000000 0.000000 - 0.004225 0.000000 0.000000  :  0.000000
codepage try 11      cp866   bg : 0.007804 0.000000 0.000000 - 0.004228 0.000000 0.000000  :  0.000000
codepage try 12      koi8r   bg : 0.007804 0.000000 0.000000 - 0.004231 0.000000 0.000000  :  0.000000
codepage try 13     cp1250   cs : 0.002194 0.046076 0.000000 - 0.003874 0.003557 0.000000  :  0.000000
codepage try 14     cp1250   pl : 0.002295 0.044548 0.000000 - 0.003812 0.002989 0.000000  :  0.000000
codepage try 15     cp1250   pl : 0.002176 0.045821 0.000000 - 0.003721 0.003177 0.000000  :  0.000000
codepage try 16     cp1252   fr : 0.001314 0.058759 0.000000 - 0.002842 0.006764 0.000000  :  0.000000
codepage try 17     cp1252   fr : 0.001289 0.058814 0.000000 - 0.002726 0.006808 0.000000  :  0.000000
codepage try 18      cp850   fr : 0.001314 0.058759 0.000000 - 0.002842 0.006764 0.000000  :  0.000000
codepage try 19      cp850   fr : 0.001289 0.058814 0.000000 - 0.002726 0.006808 0.000000  :  0.000000
codepage try 20     cp1252   de : 0.001754 0.057653 0.000000 - 0.003420 0.005718 0.000000  :  0.000000
codepage try 21     cp1252   de : 0.001633 0.057044 0.000000 - 0.003169 0.005386 0.000000  :  0.000000
codepage try 22      cp850   de : 0.001754 0.057653 0.000000 - 0.003411 0.005719 0.000000  :  0.000000
codepage try 23      cp850   de : 0.001632 0.057056 0.000000 - 0.003174 0.005384 0.000000  :  0.000000
codepage try 24     cp1252   es : 0.001410 0.057898 0.000000 - 0.003017 0.005842 0.000000  :  0.000000
codepage try 25     cp1252   es : 0.001489 0.056810 0.000000 - 0.002972 0.005636 0.000000  :  0.000000
codepage try 26      cp850   es : 0.001410 0.057902 0.000000 - 0.003020 0.005841 0.000000  :  0.000000
codepage try 27      cp850   es : 0.001489 0.056818 0.000000 - 0.002972 0.005637 0.000000  :  0.000000
codepage try 28       utf8   ee : 0.002159 0.053679 0.000000 - 0.003767 0.004415 0.000000  :  0.000000
codepage try 29      cp775   ee : 0.001605 0.055406 0.000000 - 0.003276 0.004800 0.000000  :  0.000000
codepage try 30     cp1253   gr : 0.007371 0.003378 0.000000 - 0.004029 0.000272 0.000000  :  0.000000
codepage try 31      cp737   gr : 0.007368 0.003390 0.000000 - 0.004027 0.000270 0.000000  :  0.000000
codepage try 32     cp1257  lit : 0.002418 0.047710 0.000000 - 0.003797 0.003632 0.000000  :  0.000000
codepage try 33     cp1257  lat : 0.002435 0.050796 0.000000 - 0.003835 0.004138 0.000000  :  0.000000
codepage try 34     cp1250   sr : 0.001935 0.053325 0.000000 - 0.003718 0.004193 0.000000  :  0.000000
codepage try 35     cp1252   it : 0.001198 0.060010 0.000000 - 0.002844 0.006042 0.000000  :  0.000000
codepage try 36     cp1254   tr : 0.002501 0.044686 0.000000 - 0.003829 0.003548 0.000000  :  0.000000
codepage try 37        gbk   zh : 0.007420 0.001461 0.000000 - 0.003623 0.000000 0.000000  :  0.000000
codepage try 38        gbk   zh : 0.007247 0.002702 0.000000 - 0.003590 0.000082 0.000000  :  0.000000
codepage try 39  shift_jis   ja : 0.006882 0.005086 0.000000 - 0.003738 0.000238 0.000000  :  0.000000
codepage try 40  shift_jis   ja : 0.006437 0.008548 0.000000 - 0.003443 0.000887 0.000000  :  0.000000
codepage try 41      eucjp   ja : 0.007390 0.002962 0.000000 - 0.003526 0.000242 0.000000  :  0.000000
codepage try 42      eucjp   ja : 0.006780 0.007335 0.000000 - 0.003267 0.000884 0.000000  :  0.000000
codepage try 43       big5   ja : 0.006132 0.009419 0.000000 - 0.003749 0.000715 0.000000  :  0.000000
codepage try 44       big5   zh : 0.006839 0.004032 0.000000 - 0.004109 0.000001 0.000000  :  0.000000
codepage try 45     euc_kr   ko : 0.007801 0.000000 0.000000 - 0.004151 0.000000 0.000000  :  0.000000
codepage try 46     cp1252   de : 0.001620 0.058085 0.000000 - 0.003134 0.005787 0.000000  :  0.000000
codepage try 47       utf8   de : 0.001677 0.056800 0.000000 - 0.003135 0.005675 0.000000  :  0.000000
codepage try 48 iso8859-16   ro : 0.001262 0.054260 0.000000 - 0.003049 0.005328 0.000000  :  0.000000
codepage try 49     cp1250   ro : 0.001344 0.053324 0.000000 - 0.003065 0.005181 0.000000  :  0.000000
codepage try 50       utf8   ro : 0.001698 0.049567 0.000000 - 0.002993 0.004847 0.000000  :  0.000000
codepage try 51     cp1250   cz : 0.002115 0.044861 0.000000 - 0.003513 0.003487 0.000000  :  0.000000
codepage try 52     cp1250   hu : 0.002147 0.048202 0.000000 - 0.003319 0.003890 0.000000  :  0.000000
codepage try 53     cp1250   hu : 0.001564 0.049597 0.000000 - 0.002651 0.004676 0.000000  :  0.000000
Detected codepage:cp1251 lang:ru index:0

With the 0062.txt from this issue:

codepage try 0     cp1251   ru : 0.007533 0.002000 0.001937 - 0.006846 0.000036 0.000036  :  0.284523
codepage try 1     cp1251   ru : 0.007110 0.003175 0.001655 - 0.006592 0.000043 0.000043  :  0.260387
codepage try 2     cp1251   ru : 0.006620 0.010313 0.001537 - 0.006921 0.000274 0.000025  :  0.237944
codepage try 3      cp866   ru : 0.007789 0.000114 0.000051 - 0.007081 0.000000 0.000000  :  0.006803
codepage try 4      koi8r   ru : 0.007770 0.000221 0.000158 - 0.007081 0.000000 0.000000  :  0.021252
codepage try 5      koi8r   ru : 0.007791 0.000068 0.000064 - 0.007161 0.000000 0.000000  :  0.008524
codepage try 6       utf8   ru : 0.007801 0.000027 0.000000 - 0.007080 0.000001 0.000000  :  0.000000
codepage try 7     cp1251   ru : 0.007547 0.001883 0.001864 - 0.006844 0.000023 0.000023  :  0.268619
codepage try 8      cp866   ru : 0.007789 0.000121 0.000101 - 0.007092 0.000000 0.000000  :  0.013595
codepage try 9      koi8r   ru : 0.007778 0.000161 0.000141 - 0.006880 0.000000 0.000000  :  0.019198
codepage try 10     cp1251   bg : 0.007502 0.002280 0.002280 - 0.006956 0.000059 0.000059  :  0.339644
codepage try 11      cp866   bg : 0.007786 0.000159 0.000159 - 0.007159 0.000000 0.000000  :  0.021229
codepage try 12      koi8r   bg : 0.007774 0.000227 0.000227 - 0.007159 0.000000 0.000000  :  0.030393
codepage try 13     cp1250   cs : 0.002825 0.043195 0.000000 - 0.004732 0.003553 0.000000  :  0.000000
codepage try 14     cp1250   pl : 0.003327 0.040429 0.000000 - 0.004887 0.002688 0.000000  :  0.000000
codepage try 15     cp1250   pl : 0.003265 0.041570 0.000000 - 0.004695 0.002888 0.000000  :  0.000000
codepage try 16     cp1252   fr : 0.002789 0.051152 0.000000 - 0.004339 0.004542 0.000000  :  0.000000
codepage try 17     cp1252   fr : 0.002735 0.051449 0.000000 - 0.004365 0.004504 0.000000  :  0.000000
codepage try 18      cp850   fr : 0.002789 0.051152 0.000000 - 0.004124 0.004542 0.000000  :  0.000000
codepage try 19      cp850   fr : 0.002735 0.051449 0.000000 - 0.004365 0.004504 0.000000  :  0.000000
codepage try 20     cp1252   de : 0.002951 0.049871 0.000305 - 0.004459 0.004860 0.000008  :  0.088634
codepage try 21     cp1252   de : 0.002854 0.049772 0.000326 - 0.004309 0.004705 0.000007  :  0.096790
codepage try 22      cp850   de : 0.003009 0.049566 0.000000 - 0.004287 0.004853 0.000000  :  0.000000
codepage try 23      cp850   de : 0.002911 0.049456 0.000000 - 0.004120 0.004697 0.000000  :  0.000000
codepage try 24     cp1252   es : 0.002969 0.049779 0.000000 - 0.004543 0.004876 0.000000  :  0.000000
codepage try 25     cp1252   es : 0.003104 0.047908 0.000000 - 0.004687 0.004444 0.000000  :  0.000000
codepage try 26      cp850   es : 0.002969 0.049782 0.000000 - 0.004331 0.004876 0.000000  :  0.000000
codepage try 27      cp850   es : 0.003103 0.047915 0.000000 - 0.004578 0.004445 0.000000  :  0.000000
codepage try 28       utf8   ee : 0.002260 0.052438 0.000000 - 0.003465 0.004848 0.000000  :  0.000000
codepage try 29      cp775   ee : 0.001981 0.054261 0.000501 - 0.003508 0.004527 0.000014  :  0.198422
codepage try 30     cp1253   gr : 0.007271 0.003444 0.000908 - 0.006642 0.000131 0.000000  :  0.130477
codepage try 31      cp737   gr : 0.007406 0.002544 0.000000 - 0.006854 0.000125 0.000000  :  0.000000
codepage try 32     cp1257  lit : 0.002558 0.049179 0.000000 - 0.004491 0.003439 0.000000  :  0.000000
codepage try 33     cp1257  lat : 0.002437 0.052761 0.000000 - 0.004227 0.003742 0.000000  :  0.000000
codepage try 34     cp1250   sr : 0.002723 0.050294 0.000000 - 0.004362 0.004352 0.000000  :  0.000000
codepage try 35     cp1252   it : 0.002783 0.052334 0.000000 - 0.004402 0.005143 0.000000  :  0.000000
codepage try 36     cp1254   tr : 0.002730 0.044586 0.000040 - 0.004195 0.003995 0.000000  :  0.011567
codepage try 37        gbk   zh : 0.007451 0.001347 0.000253 - 0.005340 0.000000 0.000000  :  0.039594
codepage try 38        gbk   zh : 0.007306 0.002423 0.000336 - 0.005184 0.000027 0.000000  :  0.053756
codepage try 39  shift_jis   ja : 0.006897 0.004654 0.000087 - 0.005963 0.000118 0.000000  :  0.013609
codepage try 40  shift_jis   ja : 0.006545 0.007485 0.000104 - 0.006039 0.000437 0.000000  :  0.016473
codepage try 41      eucjp   ja : 0.007359 0.002937 0.000447 - 0.006023 0.000125 0.000000  :  0.066870
codepage try 42      eucjp   ja : 0.006861 0.006427 0.000271 - 0.006089 0.000432 0.000000  :  0.041809
codepage try 43       big5   ja : 0.006093 0.009377 0.000466 - 0.005148 0.000352 0.000000  :  0.082888
codepage try 44       big5   zh : 0.006696 0.004541 0.000086 - 0.005521 0.000007 0.000000  :  0.014153
codepage try 45     euc_kr   ko : 0.007756 0.000149 0.000149 - 0.006520 0.000005 0.000005  :  0.023024
codepage try 46     cp1252   de : 0.002873 0.050610 0.000369 - 0.004353 0.004796 0.000007  :  0.108167
codepage try 47       utf8   de : 0.003011 0.049131 0.000000 - 0.004195 0.004699 0.000000  :  0.000000
codepage try 48 iso8859-16   ro : 0.002613 0.048637 0.000000 - 0.004675 0.004003 0.000000  :  0.000000
codepage try 49     cp1250   ro : 0.002666 0.047795 0.000000 - 0.004695 0.003881 0.000000  :  0.000000
codepage try 50       utf8   ro : 0.002906 0.044458 0.000027 - 0.004593 0.003627 0.000000  :  0.007299
codepage try 51     cp1250   cz : 0.002980 0.040700 0.000000 - 0.004182 0.003364 0.000000  :  0.000000
codepage try 52     cp1250   hu : 0.002687 0.044525 0.000041 - 0.003931 0.004014 0.000003  :  0.015245
codepage try 53     cp1250   hu : 0.002416 0.045296 0.000030 - 0.003836 0.003947 0.000002  :  0.011228
Detected codepage:cp1251 lang:bg index:10

I don't want to understand the algorithm used :) Thought about just reordering the stuff so russian is no more first - but that wouldn't help with the 0062.txt file.

hius07 commented 3 years ago

Another non-UTF-8 txt file (it is UTF-16 LE) - crengine shows just white screen. Text editor shows it well.

001.txt

Frenzie commented 3 years ago

My editor seems to think it's UTF-8 (or defaults to it) and displays it fine afaict? Unless it doesn't say тиритротот.

hius07 commented 3 years ago

тиритротот

Exactly. Why KOReader shows white screen?

poire-z commented 3 years ago

Probably because it is too small. Add a few more lines and it shows correctly.

hius07 commented 3 years ago

Probably because it is too small.

I've just made a new txt file in Text editor with only торо in it, cre shows okay.

poire-z commented 3 years ago

https://github.com/koreader/crengine/blob/bc2b31e6b37fdb7eca78b6b96dde181ad0ca6f72/crengine/src/lvdocview.cpp#L4770-L4792

торо or тиритротот fail both the LVTextParser because it requires at least 16 chars. So, they go thru LVTextRobustParser, added by https://github.com/koreader/crengine/pull/39, which just skips the encoding detection and 16 chars requirement and assume utf8. Anyway, they use some code in lvxml.cpp, LVTextFileBase that seem to work with lines. Why it would fail with a single line of a few chars and not fail with another single line of a few chars, I dunno :/ I'm not familiar with it, and I don't want to get familiar with it :) It's thousand of lines of code dedicated to parsing text files and their various pecularities, that I'm not really interested in :/ My investigation ends there :)

hius07 commented 3 years ago

Thanks for that! The example is from the CoolReader 4pda.ru forum, unfortunately, no devs are there but virxkane.

Frenzie commented 3 years ago

Wasn't single line or empty or something a known issue or was that resolved in the past few years.

On Feb 11 2021, at 6:50 pm, hius07 notifications@github.com wrote:

Thanks for that! The example is from the CoolReader 4pda.ru forum, unfortunately, no devs are there but virxkane. — You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/koreader/koreader/issues/7000#issuecomment-777674624), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AABRQBJ3BXBFH22ESNK7B73S6QKHJANCNFSM4U5BP6KA).