[python] Binary string response encoding with Python 3

jqr commented 8 years ago

When using binary string responses under Python 3, the encoding defaults to UTF8 which is incompatible with binary data. This looks to stem from Python 3 having str default to UTF8 and a new type called bytes which represents binary data. This is a significant departure from functionality in Python 2 and you can read more about it.

This issue does not exist in Python 2.

Swagger Revision: 18262c1b6132a28438e2c3d9426458c413f011f8 (master as of right now)

Response schema:

schema:
  type: string
  format: binary

Example Error:

Traceback (most recent call last):
  File "./dr.py", line 13, in <module>
    "document_type": "pdf",                                         # pdf or xls or xlsx
  File "/usr/local/lib/python3.5/site-packages/docraptor/apis/doc_api.py", line 203, in create_doc
    callback=params.get('callback'))
  File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 322, in call_api
    response_type, auth_settings, callback)
  File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 149, in __call_api
    post_params=post_params, body=body)
  File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 358, in request
    body=body)
  File "/usr/local/lib/python3.5/site-packages/docraptor/rest.py", line 208, in POST
    body=body)
  File "/usr/local/lib/python3.5/site-packages/docraptor/rest.py", line 171, in request
    r.data = r.data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

I'm not sure what a good general fix would look like but I have a work-around for our use by delaying byte decoding until right before JSON deserialization and short-circuiting str. You can see my fix directly on generated code.

Related Issues and PRs:

wing328 commented 8 years ago

@jqr thanks for the details. May I know if you've cycle to contribute the fix with a test case using this endpoint (the method name is automatically generated by codegen since it's empty)?

jqr commented 8 years ago

I'm not entirely sure the fix is generic enough to be used everywhere, as I only tested our use cases.

If someone else wants to run with this, they should as I won't have the time to address this in a generic fashion for a few weeks at least.

wing328 commented 8 years ago

@jqr please take your time. We've not heard anyone else requesting this feature in Python yet.

JHibbard commented 7 years ago

Binary file responses are also being parsed as utf-8, causing errors. Does anyone know a temporary workaround for this?

schema:
    type: file
    format: binary

wing328 commented 7 years ago

@JHibbard What about simply model the response as just type: file without format: binary?

microo8 commented 7 years ago

The server sends bytes and client wants to decode it with utf8. I've resolved it with removing the line: r.data = r.data.decode('utf8') in the rest.py file.

But then in the api_client.py file, in function __deserialize_file(self, response) it wants to write the response.data in an text file. So I've edited this also, so the function opens the file as "wb":

with open(path, "wb") as f:
    f.write(response.data)

But I don't know that what will now not work, but at least the file download works for me.

g-bon commented 6 years ago

@wing328 I see the same happening in python 2.7, and I think is a more generic problem that binary files are not written to disk correctly because they may be corrupted by the "w" mode instead of "wb" as pointed out by @microo8 .

I think it would be formally more correct to have wb instead since both text and binary file are expected to be deserialized in here.

wing328 commented 6 years ago

@g-bon I wonder if you can submit a PR with the suggested fix so that we can review more easily (remember to update the Python Petstore sample so that CIs can run the integration tests)

g-bon commented 6 years ago

@wing328, switching to python 3 I realized that there are actually two separate issues. The first one is what I fixed with the PR 6936 and it affected both 2 and 3

The second one to my understanding is that in rest.py if PY3 is used and _preload_content is True, the response contents is always decoded as utf8. Which doesn't make sense for non-text files.

if six.PY3:
    r.data = r.data.decode('utf8')

https://github.com/swagger-api/swagger-codegen/blob/62444a7aaf480d0fc12c048ce7c86f8978a16a89/modules/swagger-codegen/src/main/resources/python/rest.mustache#L213

A simple approach could be to catch the UnicodeDecodeError and manage / swallow it. Not great because it would also affect eventual real UnicodeDecodeErrors.
Another way could be to add a check on the content_type and make a whitelist/blacklist of content types that should be decoded to utf8.
The third and hardest approach would be to infer the content type from the actual data but feels a fragile and overkill approach to me.

Happy to work on this when I have some confirmation the problem is here / feedback on the right approach.

g-bon commented 6 years ago

Up cc @kenjones-cisco , @wing328

rddesmond commented 5 years ago

For my case, I'm returning a geojson file. This means that it is able to be decoded to utf8, but this will fail with the __deserialize_file method since that opens the file for binary writing. My solution was to remove the decode from "rest.moustache" (as was suggested), and move it without modification to after the file write conditionally happens in "api_client.moustache" (which was reasonable for my use case -- it's either a file or json/text.

    def deserialize(self, response, response_type):
        """Deserializes response into an object.

        :param response: RESTResponse object to be deserialized.
        :param response_type: class literal for
            deserialized object, or string of class name.

        :return: deserialized object.
        """
        # handle file downloading
        # save response body into a tmp file and return the instance
        if response_type == "file":
            return self.__deserialize_file(response)

        ########### Moved from rest.mustache
        if six.PY3:
            response.data = response.data.decode('utf8')
        logger.debug("response body: %s", response.data)
        ###########

        # fetch data from response object
        try:
            data = json.loads(response.data)
        except ValueError:
            data = response.data

        return self.__deserialize(data, response_type)

upcFrost commented 4 years ago

@rddesmond thanks for the great workaround!

From what I see the issue is not in skipping some particular Content-Type like in #9980 (which will fail for example for application/pdf, images etc), or having some hardcoded 'file' type which will trigger the different response processing logic.

The real issue is that swagger-codegen drops the format: binary from the model for type: string. If you'll check form parameters processing with debugOperations, you'll see the x-is-binary in vendorExtensions, which can be used in the template. But the response parameter doesn't have it, making it impossible to differentiate between strings and binary data.

sknick commented 3 years ago

Just want to say I've just hit on this issue as well.

IAmWebSA commented 3 years ago

Also hit the issue today

aarighi commented 1 year ago

Still blocking in 2023

aarighi commented 1 year ago

This worked for me:

Location: RESTClientObject.request (line 221)

            # Use try/except to handle issue:
            # https://github.com/swagger-api/swagger-codegen/issues/2305
            try:
                # In the python 3, the response.data is bytes.
                # we need to decode it to string.
                if six.PY3:
                    r.data = r.data.decode('utf8')
            except UnicodeDecodeError:
                content_type = r.getheader("Content-Type", None)
                if 'application/zip' in content_type:
                    pass  # allow zip content to be passed as bytes
                else:
                    raise

This could be modified/expanded to match your content type.

aarighi commented 1 year ago

You can also skip the serialization entirely by passing _preload_content=False in the method's kwargs: you will receive the HTTPResponse object with the bytes instead of the serialized data.

        folder_path = "/path/to/output/folder"

        res = api.my_api_endpoint(my_param, [...], _preload_content=False)

        zip_bytes = io.BytesIO(res.data)
        with zipfile.ZipFile(zip_bytes, "r") as zip_file:
            zip_file.extractall(folder_path)  # works!

Still, it's not an ideal solution and hopefully the issue is patched soon.

smothiki commented 3 months ago

I have just hit the issue with latest swager version. May we know when can we have this fix merged or any help needed to get this thing in?

Lenormju commented 3 weeks ago

The issue has not been fixed since 2016. The most recent release (I'm using Docker swaggerapi/swagger-codegen-cli-v3 version 3.0.61) still has the bug. I have a patch similar to some above, before the return of RESTClientObject.request in the rest.py file :

### START OF MANUAL PATCH
# see https://github.com/swagger-api/swagger-codegen/issues/2305
# In Python3, the `response.data` is bytes, so we need to decode it into str.
if six.PY3:
    try:
        r.data = r.data.decode('utf8')
    except:
        # some requests are not strings so we can't decode them
        pass
### END OF MANUAL PATCH

I have put up a Minimal Reproducible Example :

a demo server (to start separately)
an OpenAPI file to generate the client
a python file that do a request on the server
an unit test file that checks the result of ApiClient.deserialize
a patch.diff to apply the fix
and a script to run all of this in a reproducible manner

➡️ swagger_codegen_issue_2305.zip

I think the problem stems from the way ApiClient.deserialize works (cf my unit test).

Click here for an explanation of the problem with doing `str(some_bytes)` in Python3

Sadly, the `str` type applied on `bytes` returns a string representing the "binary string" of the content. ```python3 >>> str(b"example") # bytes "b'example'" >>> str("example") # Unicode string 'example' ``` Notice the leading `b`, and the inner single quotes (`'`) in the first case. Instead, the `std.decode("some-encoding")` method should be used. ```python >>> b"example".decode("utf-8") 'example' ``` This seems contrived, only because these are all ASCII characters. I can provide you real examples with non-ASCII characters.

But we can't deserialize bytes to str without knowing the encoding. And it is not always known, or true. Per this StackOverflow answer, we should assume UTF-8 for application/json, and RFCs for HTTP give some defaults for MIME types of plain/*, but this is very limited.

So I don't know how to cleanly solve this issue, but I know that returning "b'something'" is incorrect. I have opened other issues for Swagger Codegen, but there seem to be a lack of maintainers (at least for Python), so I don't know if there can be a way forward.

swagger-api / swagger-codegen

[python] Binary string response encoding with Python 3 #2305