[FEATURE] Add support for byte data in bulk serializer

roshanjrajan-zip commented 1 year ago

What is the bug?

I wanted to use an alternative serializer for performance. Looking at the original JsonSerializer, if the type is a string_types which is either str or bytes, it will return the data without using the serializer. So assumably any serializer that returns str or bytes is also valid. However, the following serializer using orjson returns bytes and bulk API only accepts strings as it tries and encode the data using utf8.

class OrjsonSerializer(JSONSerializer):
    def dumps(self, data: Any):
        if isinstance(data, string_types):
            return data

        try:
            # The .decode("utf-8") was added to convert bytes to string
            return orjson.dumps(data, default=self.default).decode("utf-8") 
        except Exception as exc:
            raise SerializationError(data, exc) from exc

    def loads(self, s: Any):
        try:
            return orjson.loads(s)
        except Exception as exc:
            raise SerializationError(s, exc) from exc

How can one reproduce the bug?

Return bytes from the Json Serializer instead of a string.

What is the expected behavior?

I'm not sure what the correct answer is but it should either only accepts strings or have a separate code path if the type is bytes

What is your host/environment?

Debian 5.10.191-1

Do you have any screenshots?

Do you have any additional context?

Add any other context about the problem.

saimedhi commented 1 year ago

@roshanjrajan-zip, Thanks for spotting the issue. Your proposed solution is appreciated – please feel free to contribute!

roshanjrajan-zip commented 1 year ago

@saimedhi Which is the preferred answer here? To only return a str or to support bytes?

saimedhi commented 1 year ago

@dblock, @wbeckler Please take a look at this issue.

dblock commented 1 year ago

So bulk expects a line-delimited JSON and the rest of the APIs expect proper JSON. Both are strings, aren't they? Isn't the custom serializer the issue, shouldn't it be producing strings in all cases (like the .decode("utf-8") you've added)? What would be a compelling reason to change the default to pass around raw bytes?

roshanjrajan-zip commented 1 year ago

I don't think there is necessary a compelling reason to support bytes, just that the code is confusing in that if bytes were passed to the serializer, then it would return bytes immediately without even using the serializer. The bug is more about the typing check being inconsistent with the use case in the bulk api. I think a good answer could be replacing the instance check isinstance(data, string_types) with isinstance(data, str) in the built-in JsonSerializer to make it clear the expected type.

dblock commented 1 year ago

Ok, I think I understand. I agree on isinstance(data, str), want to give it a try?

roshanjrajan-zip commented 1 year ago

Yeah we can do this! But is string_types supposed to include bytes? Should it just be str?

dblock commented 1 year ago

I think you know as much as I (we) do! ;)

roshanjrajan-zip commented 1 year ago

Well it looks like the elasticsearch team supports bytes only for serializers - https://github.com/elastic/elastic-transport-python/blob/main/elastic_transport/_serializer.py so not sure how much you are keeping parity on the interfaces.

dblock commented 12 months ago

@roshanjrajan-zip we aren't following changes in ES clients, but since those are APLv2 please feel free to port that code in a PR here (make sure to keep their license in the PR)

opensearch-project / opensearch-py