sylikc / pyexiftool

PyExifTool (active PyPI project) - A Python library to communicate with an instance of Phil Harvey's ExifTool command-line application. Runs one process with special -stay_open flag, and pipes data to/from. Much more efficient than running a subprocess for each command!
Other
161 stars 21 forks source link

Decoding error when try to open Korean language file #88

Open llitkr opened 7 months ago

llitkr commented 7 months ago

I'm using your code really well, but I've noticed one odd thing. If the name of the file contains certain Korean characters, I get a decoding error and the metadata of the file becomes unreadable. If I rename the file to something else, I can read the metadata normally.

def getFilesFromDirectory(directory):       # directory 변수로부터 해당 경로 내에 있는 모든 파일 목록 가져와 반환하기
    fileList = []       # fileList 빈 배열 만들기
    myName = Path(sys.argv[0]).name
    for filename in directory.iterdir():    # directory 경로명에 있는 모든 객체들을 filename으로서 탐색
        if filename.is_file() and filename.name != myName:      # filename이 만약 폴더가 아닌 파일이라면
            fileList.append(filename)   # fileList 배열 마지막에 삽입하기
    return fileList     # fileList 배열을 getFilesFromDirectory를 호출했던 곳에 반환하기

mainDirectory = "./2023-02-05"

files = getFilesFromDirectory(Path(mainDirectory))     # 테스트용 폴더 "./2023-02-05"에서 모든 파일 목록 가져와 files 변수에 저장하기
fileList = files[:]
e = ExifToolHelper()    # ExifToolHelper를 e로서 초기화하기
for i in range(len(files)):     # files 배열의 개수만큼 아래 내용 반복하기(i=0부터 files의 개수만큼 변경)
    f = files[i]        # f에 i번째 files 배열 값 저장하기
    files[i] = {}
    try:
        files[i]['dirName'] = os.path.dirname(os.path.abspath(f))
        files[i]['filename'] = os.path.splitext(os.path.basename(f))[0]
        files[i]['fullName'] = os.path.basename(f)
        files[i]['fullPath'] = f
        files[i]['ext'] = os.path.splitext(f)[1][1:]
        try:
            **m = e.get_metadata(f)       # f의 EXIF 메타데이터 가져와 m에 저장하기**
            if "ExifTool:Warning" in m[0]:
                if m[0]["ExifTool:Warning"] == "End of processing at large atom (LargeFileSupport not enabled)":
                    m = e.get_metadata(f, params=['-api', 'largefilesupport=1'])
        except Exception as err:
            print(f"파일 : {f}")      # f의 파일명 출력하기
            print(f"e.get_metadata 실패. EXIF 정보를 불러올 수 없는 파일: 에러 내용 ({err})")
            continue
    except exceptions.ExifToolExecuteError as err:  # 만약 EXIF 데이터를 가져오는 도중 에러가 발생하면 해당 에러 출력해주기(modd 파일은 괜찮은데 moff 파일의 경우 ExifTool로 메타데이터 가져오기 시도 시 에러 발생)
        print(f"파일 : {f}")      # f의 파일명 출력하기
        print(f"EXIF 정보를 불러올 수 없는 파일: 에러 내용 ({err}")
        continue

This code works well for the most part, but the bolded part, m = e.get_metadata(f), often results in an error. This happens when the filename contains certain Korean characters, which I've found to be the case for the filename "ㅍ휸ㅇ류.JPG".

The error content is as follows

'cp949' codec can't decode byte 0xb7 in position 37: illegal multibyte sequence

I looked up this error, and it says that I just need to specify UTF-8 as the encoding option when opening the file in Python. However, Python is opening the file fine, and it's doing a good job of displaying other properties of the file. Is there any option in EXIFTool to control the encoding related part?

sylikc commented 5 months ago

This might be related to the fs encoding stuff which I had a lot of trouble with early on. I'm going to try to replicate this issue when I have time, and try to create a test case. See if I can find code to work around this.

sylikc commented 5 months ago

Ok, I can replicate this issue... question, why are you using pathlib and os.path at the same time?!

after renaming a file to ㅍ휸ㅇ류.JPG , I can replicate the crash... I would probably try asking on the exiftool.org forums to see if there's any specific encoding problems.

I tried both utf-8 and euc_kr or cp949 and all didn't work

from exiftool import ExifToolHelper, exceptions
from pathlib import Path
import sys
import os, logging

logging.basicConfig(level=logging.DEBUG)

def getFilesFromDirectory(p):       # directory 변수로부터 해당 경로 내에 있는 모든 파일 목록 가져와 반환하기
    return [x for x in p.iterdir() if x.is_file() and x.name != Path(sys.argv[0]).name]

mainDirectory = "./"

files = getFilesFromDirectory(Path(mainDirectory))     # 테스트용 폴더 "./2023-02-05"에서 모든 파일 목록 가져와 files 변수에 저장하기

e = ExifToolHelper(encoding='euc_kr', logger=logging.getLogger(__name__))    # ExifToolHelper를 e로서 초기화하기
print(e.encoding)

for f in files:     # files 배열의 개수만큼 아래 내용 반복하기(i=0부터 files의 개수만큼 변경)
    m = e.get_metadata(f)       # f의 EXIF 메타데이터 가져와 m에 저장하기**

It's a bug but it may not be in pyexiftool... I'm not sure which korean encoding should be used for that

llitkr commented 5 months ago

Ok, I can replicate this issue... question, why are you using pathlib and os.path at the same time?!

after renaming a file to ㅍ휸ㅇ류.JPG , I can replicate the crash... I would probably try asking on the exiftool.org forums to see if there's any specific encoding problems.

I tried both utf-8 and euc_kr or cp949 and all didn't work

from exiftool import ExifToolHelper, exceptions
from pathlib import Path
import sys
import os, logging

logging.basicConfig(level=logging.DEBUG)

def getFilesFromDirectory(p):       # directory 변수로부터 해당 경로 내에 있는 모든 파일 목록 가져와 반환하기
  return [x for x in p.iterdir() if x.is_file() and x.name != Path(sys.argv[0]).name]

mainDirectory = "./"

files = getFilesFromDirectory(Path(mainDirectory))     # 테스트용 폴더 "./2023-02-05"에서 모든 파일 목록 가져와 files 변수에 저장하기

e = ExifToolHelper(encoding='euc_kr', logger=logging.getLogger(__name__))    # ExifToolHelper를 e로서 초기화하기
print(e.encoding)

for f in files:     # files 배열의 개수만큼 아래 내용 반복하기(i=0부터 files의 개수만큼 변경)
  m = e.get_metadata(f)       # f의 EXIF 메타데이터 가져와 m에 저장하기**

It's a bug but it may not be in pyexiftool... I'm not sure which korean encoding should be used for that

Hello. Thank you for reply.

The reason why I used both of pathlib and os.path, actually I don't know.

Python is not my specialty, and I received help from ChatGPT while creating this side project, and he wrote this code when I asked to read the list of files in a folder.

If there is a better way, I would appreciate it if you could let me know.

Anyway, this bug is very strange. An error always occurs when the file name is "ㅍ휸ㅇ류.JPG", but when the file name is "ㅍ휸류ㅇ류.JPG" (just one letter added), the bug does not occur. A bug occurs when the file name is "ㅍ휸휸휸ㅇ류.JPG". There certainly seems to be a problem with the letter "휸". However, even if the letter "휸" is included in the file name, it does not seem to necessarily cause a problem. In many cases, problems do not occur depending on the combination with other letters.

This problem is really interesting. If I had been deeply involved in the development of this program, I would have reproduced the problem step by step through debugging and looked for the cause.

sylikc commented 5 months ago

This is more of an encoding bug. I don't really know what codepage or encoding is being used.

Does it work on the command line? perhaps try exiftool's forums