wagnerrp / pytmdb3

Python interface to TheMovieDB.org v3 API
BSD 3-Clause "New" or "Revised" License
120 stars 36 forks source link

Request params encoded using system encoding #39

Open gazpachoking opened 10 years ago

gazpachoking commented 10 years ago

Perhaps I'm misunderstanding how this is supposed to work, but it looks like all request parameters are encoded using the system locale encoding. (https://github.com/wagnerrp/pytmdb3/blob/master/tmdb3/request.py#L70) This causes problems when the system locale cannot encode all the charaters in the parameters, plus, I have no idea how tmdb is expected to know what encoding you have used to encode the parameters, I suspect it should be using a constant encoding defined by the tmdb api. Portion of a relevant traceback:

File "/usr/local/lib/python2.7/dist-packages/flexget/plugins/api_tmdb.py", line 293, in lookup
    result = _first_result(tmdb3.tmdb_api.searchMovie(title.lower(), adult=True, year=year))
  File "/usr/local/lib/python2.7/dist-packages/tmdb3/tmdb_api.py", line 128, in searchMovie
    return MovieSearchResult(Request('search/movie', **kwargs), locale=locale)
  File "/usr/local/lib/python2.7/dist-packages/tmdb3/request.py", line 71, in __init__
    kwargs[k] = locale.encode(v)
  File "/usr/local/lib/python2.7/dist-packages/tmdb3/locales.py", line 110, in encode
    return dat.encode(self.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-13: ordinal not in range(256)

Downstream ticket: http://flexget.com/ticket/2392

gazpachoking commented 10 years ago

Did a bit of testing, looks like tmdb is expecting utf-8 encoding. Did a bit of a hack to get things working again:

# Before. Broken
>>> tmdb3.tmdb_api.searchMovie(u'Generation П')[0]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "build\bdist.win32\egg\tmdb3\tmdb_api.py", line 128, in searchMovie
    return MovieSearchResult(Request('search/movie', **kwargs), locale=locale)
  File "build\bdist.win32\egg\tmdb3\request.py", line 70, in __init__
    kwargs[k] = locale.encode(v)
  File "build\bdist.win32\egg\tmdb3\locales.py", line 110, in encode
    return dat.encode(self.encoding)
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u041f' in position 11: character maps to <undefined>

# Hack to fix encoding
>>> tmdb3.locales.set_locale("en", "us", True)
>>> tmdb3.locales.syslocale.encoding = 'utf-8'

# After. Working.
>>> tmdb3.tmdb_api.searchMovie(u'Generation П')[0]
<Movie 'Generation P' (2011)>
wagnerrp commented 10 years ago

If the user is going to be accessing unicode content, such as movies with the character "П" in the title, it expects the user will have configured their system to handle unicode content. Specifically, that means configuring a UTF language in their environment.

# unconfigured default
> locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
# Bourne users
> export LANG="en_US.UTF-8"
# C-shell users
> setenv LANG en_US.UTF-8
# confirmation
> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

The tmdb3 library will then pull that encoding from the environment using the locale library.

> projects/pytmdb3/scripts/pytmdb3.py
PyTMDB3 Interactive Shell. TAB completion available.
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> get_locale().encoding
'UTF-8'
gazpachoking commented 10 years ago

The problem is, we can't just pick an arbitrary encoding when sending requests to tmdb. They are expecting utf-8.

gazpachoking commented 10 years ago

It has nothing to do with the platform we are running on what encoding the api expects.

gazpachoking commented 10 years ago

Here is some more evidence that just picking a codec that supports all unicode codepoints still isn't correct. It has to be in the encoding tmdb is expecting in order for it to be able to decode again:


>>> tmdb3.locales.syslocale.encoding = 'utf-8'
>>> tmdb3.tmdb_api.searchMovie(u'Generation П')[0]
<Movie 'Generation P' (2011)>
>>> tmdb3.locales.syslocale.encoding = 'utf-16'
>>> tmdb3.tmdb_api.searchMovie(u'Generation П')[0]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\tmdb_api.py", line 128, in searchMovie
    return MovieSearchResult(Request('search/movie', **kwargs), locale=locale)
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\tmdb_api.py", line 157, in __init__
    lambda x: Movie(raw=x, locale=locale))
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\pager.py", line 106, in __init__
    super(PagedRequest, self).__init__(self._getpage(1), 20)
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\pager.py", line 59, in __init__
    self._data = list(iterable)
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\pager.py", line 110, in _getpage
    res = req.readJSON()
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\cache.py", line 118, in __call__
    data = self.func(*args, **kwargs)
  File "C:\Users\chase.sterling\PycharmProjects\Flexget\lib\site-packages\tmdb3\request.py", line 125, in readJSON
    raise e
TMDBHTTPError: HTTP Error 500: Internal Server Error
wagnerrp commented 10 years ago

The environment does need to be configured for unicode to receive unicode responses from TMDb, due to the behavior of Python 2 itself, however I'll need to look at this again to figure out how to handle non-bytecode encodings.

gazpachoking commented 10 years ago

This should be entirely independent of the environment. Unicode is unicode no matter what locale an user has set. Tmdb declares what encoding they accept and send for byte strings, and the python library should only expose and accept strings as unicode objects to the user. If the user tries to query the library with a bytestring (str, python 2) representing non-ascii characters is the only time an error should be raised.

gregorvolkmann commented 5 years ago

tmdb3.locales.syslocale.encoding = 'utf-8' fixed also TMDbError Internal error - Something went wrong. Contact TMDb. on tmdb3.MovieSearch('some string with äüö') Thanks @gazpachoking !