okfn / ckanclient-deprecated

DEPRECATED - please see https://github.com/ckan/ckanapi. [Python client library for CKAN]
http://pypi.python.org/pypi/ckanclient
25 stars 17 forks source link

Problem when handling documents with unicode #32

Open dyanos opened 10 years ago

dyanos commented 10 years ago

Following ckanclient code(at '_encode_multipart_formdata' in init.py), the code making multipart-form body with document having unicode characters can't process, because Python occurs a error during process converting document using UNICODE into ASCII to make multipart-form body.)

To process document included some unicode characters, it need to modify mulitpart-form processing code that is able to handle UNICODE characters.

ckanclient/init.py

…
import logging
logger = logging.getLogger('ckanclient')

PAGE_SIZE = 10

# from [ 
class MultipartFormdataEncoder(object):
    def __init__(self):
        self.boundary = uuid.uuid4().hex
        self.content_type = 'multipart/form-data; boundary={}'.format(self.boundary)

    @classmethod
    def u(cls, s):
        if sys.hexversion < 0x03000000 and isinstance(s, str):
            s = s.decode('utf-8')
        if sys.hexversion >= 0x03000000 and isinstance(s, bytes):
            s = s.decode('utf-8')
        return s

    def iter(self, fields, files):
        """
        fields is a sequence of (name, value) elements for regular form fields.
        files is a sequence of (name, filename, file-type) elements for data to be uploaded as files
        Yield body's chunk as bytes
        """
        encoder = codecs.getencoder('utf-8')
        for (key, value) in fields:
            key = self.u(key)
            yield encoder('--{}\r\n'.format(self.boundary))
            yield encoder(self.u('Content-Disposition: form-data; name="{}"\r\n').format(key))
            yield encoder('\r\n')
            if isinstance(value, int) or isinstance(value, float):
                value = str(value)
            yield encoder(self.u(value))
            yield encoder('\r\n')
        for (key, filename, fd) in files:
            key = self.u(key)
            filename = self.u(filename)
            yield encoder('--{}\r\n'.format(self.boundary))
            yield encoder(self.u('Content-Disposition: form-data; name="{}"; filename="{}"\r\n').format(key, filename))
            yield encoder('Content-Type: {}\r\n'.format(mimetypes.guess_type(filename)[0] or 'application/octet-stream'))
            yield encoder('\r\n')
            with fd:
                buff = fd.read()
                yield (buff, len(buff))
            yield encoder('\r\n')
        yield encoder('--{}--\r\n'.format(self.boundary))

    def encode(self, fields, files):
        body = io.BytesIO()
        for chunk, chunk_len in self.iter(fields, files):
            body.write(chunk)
        return self.content_type, body.getvalue()
# to ]

class CkanApiError(Exception): pass
…
…
    #
    # Private Helpers
    #
    def _post_multipart(self, url, fields, files):
        '''Post fields and files to an http host as multipart/form-data.

        Taken from
        http://code.activestate.com/recipes/146306-http-client-to-post-using-multipartform-data/

        :param fields: a sequence of (name, value) tuples for regular form
            fields
        :param files: a sequence of (name, filename, value) tuples for data to
            be uploaded as files

        :returns: the server's response page

        '''
        content_type, body = MultipartFormdataEncoder().encode(fields, files) # modified this line
        headers = {'Content-Type': content_type}

        # If we got a relative url from api, and we need to build a absolute
        url = urlparse.urljoin(self.base_location, url)

        # If we are posting to ckan, we need to add ckan auth headers.
        if url.startswith(urlparse.urljoin(self.base_location, '/')):
            headers.update({
                'Authorization': self.api_key,
                'X-CKAN-API-Key': self.api_key,
            })

        request = Request(url, data=body, headers=headers)
        response = urlopen(request)
        return response.getcode(), response.read()
…
rufuspollock commented 10 years ago

@hoedic any thoughts here

Hoedic commented 10 years ago

I am not sure to understand where the MultipartFormdataEncoder class from the code snippet comes from...

In any case, the code in the _post_multipart function is the old one, which indeed does not support unicode. @dyanos , can you try to get the code from the last commit (https://github.com/okfn/ckanclient/commit/2cd7096f1f9b5aa859281c899d8d5eda821762b9) hopefully it will work in your case. If not, please post the error trail that you get.

rufuspollock commented 10 years ago

@Hoedic should we pushing a new release of ckanclient with your fixes in?

dyanos commented 10 years ago

@Hoedic: I got the "MultipartFormdataEncoder"'s source code at the answer of a question of stackoverflow.(http://stackoverflow.com/questions/1270518/python-standard-library-to-post-multipart-form-data-encoded-data/1270548#1270548)

and I tried using code of last commit and I got the following error message

Traceback (most recent call last):
  File "sample.py", line 51, in <module>
    main()
  File "sample.py", line 40, in main
    location, tmp = ckan.upload_file(resource_info['@file'])
  File "C:\Users\Sim\Documents\Projects\regdb\regdb\ckanclient\__init__.py", line 584, in upload_file
    errcode, body = self._post_multipart(auth_dict['action'].encode('ascii'), fi
elds, files)
  File "C:\Users\Sim\Documents\Projects\regdb\regdb\ckanclient\__init__.py", line 490, in _post_multipart
    content_type, body = self._encode_multipart_formdata(fields, files)
  File "C:\Users\Sim\Documents\Projects\regdb\regdb\ckanclient\__init__.py", line 537, in _encode_multipart_formdata
    body = CRLF.join(L)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1285: ordinal not in range(128)
Hoedic commented 10 years ago

@dyanos : did you pull the last master branch? Since the last pull request (15 days ago), the _encode_multipart_form function has been removed and the _post_multipart function uses pycurl to build the message: https://github.com/okfn/ckanclient/blob/master/ckanclient/__init__.py#L479

@rgrp : Before pushing a new version, it would be great to have a little more people doing some tests on this code. On top of that, I hope I will be able to do integrate and test the python-requests lib during in the coming around the end of the year.

dyanos commented 10 years ago

@Hoedic : I'm sorry that I used wrong branch's code... So, I fixed it, and retried. However I got the following error message...

Traceback (most recent call last):
  File "sample.py", line 51, in <module>
    main()
  File "sample.py", line 40, in main
    location, tmp = ckan.upload_file(resource_info['@file'])
  File "/home/dyanos/rdf/ckanclient/__init__.py", line 570, in upload_file
    errcode, body = self._post_multipart(auth_dict['action'].encode('ascii'), fields, files)
  File "/home/dyanos/rdf/ckanclient/__init__.py", line 502, in _post_multipart
    c.setopt(c.URL, url)
TypeError: invalid arguments to setopt

I know that this message occurs that 'url' variable has unicode string. So, I changed to non-unicode string...(used str() for testing), I got the following message..

Traceback (most recent call last):
  File "sample.py", line 51, in <module>
    main()
  File "sample.py", line 40, in main
    location, tmp = ckan.upload_file(resource_info['@file'])
  File "/home/dyanos/rdf/ckanclient/__init__.py", line 570, in upload_file
    errcode, body = self._post_multipart(auth_dict['action'].encode('ascii'), fields, files)
  File "/home/dyanos/rdf/ckanclient/__init__.py", line 508, in _post_multipart
    'Accept-Encoding: identity'
TypeError: list items must be string objects

I'm handling the unicode string in my python source code and documents, and my linux's 'LANG' variable is 'en_US.UTF-8'. Whether these are related?

Hoedic commented 10 years ago

Well, we are progressing, we have new error message!

My piece of code is forcing the url to ascii encoding (auth_dict['action'].encode('ascii')) and surprisingly it does not seem to be the issue. However, it really seem that the type of the url is incorrect. Can you try to print the url value before being used line 502? Or just, does your code is available somewhere so that I can have a look?

dyanos commented 10 years ago

Hi, @Hoedic I'm sorry for replying late. the printed result of url is

http://data.datahub.kr/storage/upload_handle
Hoedic commented 10 years ago

Do you have the actual code calling the CKAN client? I see a sample.py, can I see this code? Or at least know what (type, value) is passed to the upload function: resource_info['@file']

dyanos commented 10 years ago

@Hoedic : I uploaded the source code of 'sample.py' here.

import ckanclient
import os,string,sys,json

# http://chanik.egloos.com/3685653

def usage():
  print "%s <json file of description of package>" % (sys.argv[0])
  print
  print "Reference Site: "
  print "The format to register a package : <http://docs.ckan.org/en/latest/api.html#ckan.logic.action.create.package_create>"
  print "The format to register a package resource : <http://docs.ckan.org/en/latest/api.html#ckan.logic.action.create.resource_create>"
  sys.exit(-1)

def filtering(raw_data):
  data = {}
  for key in filter(lambda x: not x.startswith('@'), raw_data.keys()):
    data[key] = raw_data[key]
  print data
  return data

def main():
  configFilename = sys.argv[1]
  if not os.path.exists(configFilename): 
    print "file not exists : %s" % (configFilename)
    sys.exit(-1)

  data = json.loads(open(configFilename).read())

  ckan = ckanclient.CkanClient(base_location=data['endpoint'], api_key=data['api_key'])
  for package in data['packages']:
    print "-"*80
    properties = filtering(package)
    ckan.package_register_post(properties)

    for resource_info in package['@resource']:
      resource_properties = filtering(resource_info)

      location = ''
      if resource_info.has_key('@file'):
        location, tmp = ckan.upload_file(resource_info['@file'])
      elif resource_info.has_key('@url'):
        location = resource_info['@url']

      print location 
      ckan.add_package_resource(package_name=properties['name'], file_path_or_url='http://data.datahub.kr'+location, **resource_properties)

if __name__ == '__main__':
  if len(sys.argv) == 1:
    usage()

  main()

and resource_info['@file'] has Subway-Line-0523.rdf, at 199 of init.py, url is http://data.datahub.kr/api/storage/auth/form/2013-12-11T100031/Subway-Line-0523.rdf.

and this is full error message:

{u'maintainer': u'OKFN Korea', u'name': u'korea-street-name-code2', u'author': u'OKFN Korea', u'author_email': u'okfn.korea@gmail.com', u'notes': u'\ub300\ud55c\ubbfc\uad6d \ub3c4\ub85c\uba85 \ucf54\ub4dc \uc628\ud1a8\ub85c\uc9c0 \ub370\uc774\ud130', u'title': u'\ub300\ud55c\ubbfc\uad6d \ub3c4\ub85c\uba85 \ucf54\ub4dc \ub370\uc774\ud130(test)', u'maintainer_email': u'okfn.korea@gmail.com'}
http://data.datahub.kr/api/rest/package
{u'name': u'\uc804\uccb4 \ub3c4\ub85c\uba85 \ucf54\ub4dc \ub370\uc774\ud130', u'license': u'CC0', u'created': u'2013-11-10 16:00:00', u'format': u'rdf', u'resource_type': u'data', u'description': u'\ub300\ud55c\ubbfc\uad6d \ub3c4\ub85c\uba85 \ucf54\ub4dc \uc628\ud1a8\ub85c\uc9c0 \ub370\uc774\ud130'}
Subway-Line-0523.rdf
http://data.datahub.kr/api/storage/auth/form/2013-12-11T100031/Subway-Line-0523.rdf
http://data.datahub.kr/storage/upload_handle
Traceback (most recent call last):
  File "sample.py", line 52, in <module>
    main()
  File "sample.py", line 41, in main
    location, tmp = ckan.upload_file(resource_info['@file'])
  File "/home/dyanos/rdf/ckanclient/__init__.py", line 572, in upload_file
    errcode, body = self._post_multipart(auth_dict['action'].encode('ascii'), fields, files)
  File "/home/dyanos/rdf/ckanclient/__init__.py", line 504, in _post_multipart
    c.setopt(c.URL, url)
TypeError: invalid arguments to setopt