m00natic / vlfi

View Large Files in Emacs
457 stars 25 forks source link

Opening file with wrong encoding. #33

Open cmal opened 7 years ago

cmal commented 7 years ago

a7d6a325-a4f3-49cb-805c-e5603e77d20b

Hi, I find vlf cannot find the encoding of file and open it with right encoding the same way as GNU Emacs default find-file.

I can open the same file with right encoding with find-file.

How can I open the file with the right encoding?

Thanks!

m00natic commented 7 years ago

Can you tell what encoding find-file reports? After opening the file, this can be checked with:

M-x describe-current-coding-system

Maybe I would be able to reproduce it on arbitrary file of my own and attempt some tweaks. Otherwise it's a known issue that detecting correct encoding starting at random part of file is imperfect: #16

cmal commented 7 years ago
Coding system for saving this buffer:
  c -- chinese-gbk-dos (alias: gbk-dos cp936-dos windows-936-dos)

Default coding system (for new files):
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
  nil
Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)

Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. chinese-gbk (alias: gbk cp936 windows-936)
  3. iso-2022-cn (alias: chinese-iso-7bit)
  4. chinese-big5 (alias: big5 cn-big5 cp950)
  5. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
  6. iso-2022-7bit 
  7. iso-2022-8bit-ss2 
  8. emacs-mule 
  9. raw-text 
  10. iso-2022-jp (alias: junet)
  11. in-is13194-devanagari (alias: devanagari)
  12. utf-8-auto 
  13. utf-8-with-signature 
  14. utf-16 
  15. utf-16be-with-signature (alias: utf-16-be)
  16. utf-16le-with-signature (alias: utf-16-le)
  17. utf-16be 
  18. utf-16le 
  19. japanese-shift-jis (alias: shift_jis sjis)
  20. undecided 

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION TARGET PATTERN      CODING SYSTEM(s)
  --------- --------------      ----------------
  File I/O      "\\.dz\\'"              (no-conversion . no-conversion)
                "\\.txz\\'"             (no-conversion . no-conversion)
                "\\.xz\\'"              (no-conversion . no-conversion)
                "\\.lzma\\'"            (no-conversion . no-conversion)
                "\\.lz\\'"              (no-conversion . no-conversion)
                "\\.g?z\\'"             (no-conversion . no-conversion)
                "\\.\\(?:tgz\\|svgz\\|sifz\\)\\'"
                                        (no-conversion . no-conversion)
                "\\.tbz2?\\'"           (no-conversion . no-conversion)
                "\\.bz2\\'"             (no-conversion . no-conversion)
                "\\.Z\\'"               (no-conversion . no-conversion)
                "\\.elc\\'"             utf-8-emacs
                "\\.el\\'"              prefer-utf-8
                "\\.utf\\(-8\\)?\\'"    utf-8
                "\\.xml\\'"             xml-find-file-coding-system
                "\\(\\`\\|/\\)loaddefs.el\\'"
                                        (raw-text . raw-text-unix)
                "\\.tar\\'"             (no-conversion . no-conversion)
                "\\.po[tx]?\\'\\|\\.po\\."
                                        po-find-file-coding-system
                "\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
                                        latexenc-find-file-coding-system
                ""                      (undecided)
  Process I/O   nothing specified
  Network I/O   nothing specified
cmal commented 7 years ago

I found vlf can correctly open the file I cut from the beginning of the large file which cannot be opened correctly.

m00natic commented 7 years ago

Thank you for the details! It seems in line with what I observed once upon a time with utf-16. The case back then was that there were some magic header bytes in the beginning of the file which specified encoding. Inserting arbitrary batch from anywhere beside the beginning doesn't get this information and the insert function is unable to detect proper encoding.

Probably in such cases VLF has to keep track of the initially observed encoding and use it in case auto detection fails on other batches. I'll look deeper probably this weekend and hopefully come up with solution this time. Keep your file around for just in case ;-)

cmal commented 7 years ago

Thank you for your work. I recall that one of the chapters of Emacs or Elisp manual has some description about the magic header bytes of files with other encoding.

m00natic commented 7 years ago

I've just pushed something that fixes the issue with utf-16 (at least). Hopefully it will work in this case too.

cmal commented 7 years ago

Sorry for reopened. I just opened a wrong file. And the file mentioned above still cannot be opened correctly.

cmal commented 7 years ago

The file is on http://vdisk.weibo.com/s/utbH7Zm3Y8yvm , if you can access to it, and want to use it for testing.

To download it, please click on the image in this page,

and then click on the image in the popup window.

Note that this page should not be opened on mobile, you can check the url after opening it, the url should not be changed to http://vdisk.weibo.com/wap/s/utbH7Zm3Y8yvm .

If you cannot access to this file, and want to get this file to test, plz @ me and I will upload it to dropbox and send it to you.

Thanks a lot!

m00natic commented 7 years ago

Got the file, thanks!

So the battle continues. I'll investigate in the coming days.