Open roshanjrajan-zip opened 1 year ago
@roshanjrajan-zip, Thanks for spotting the issue. Your proposed solution is appreciated – please feel free to contribute!
@saimedhi Which is the preferred answer here? To only return a str or to support bytes?
@dblock, @wbeckler Please take a look at this issue.
So bulk
expects a line-delimited JSON and the rest of the APIs expect proper JSON. Both are strings, aren't they? Isn't the custom serializer the issue, shouldn't it be producing strings in all cases (like the .decode("utf-8")
you've added)? What would be a compelling reason to change the default to pass around raw bytes?
I don't think there is necessary a compelling reason to support bytes, just that the code is confusing in that if bytes were passed to the serializer, then it would return bytes immediately without even using the serializer. The bug is more about the typing check being inconsistent with the use case in the bulk api. I think a good answer could be replacing the instance check isinstance(data, string_types)
with isinstance(data, str)
in the built-in JsonSerializer to make it clear the expected type.
Ok, I think I understand. I agree on isinstance(data, str)
, want to give it a try?
Yeah we can do this! But is string_types
supposed to include bytes? Should it just be str?
I think you know as much as I (we) do! ;)
Well it looks like the elasticsearch team supports bytes only for serializers - https://github.com/elastic/elastic-transport-python/blob/main/elastic_transport/_serializer.py so not sure how much you are keeping parity on the interfaces.
@roshanjrajan-zip we aren't following changes in ES clients, but since those are APLv2 please feel free to port that code in a PR here (make sure to keep their license in the PR)
What is the bug?
I wanted to use an alternative serializer for performance. Looking at the original JsonSerializer, if the type is a
string_types
which is either str or bytes, it will return the data without using the serializer. So assumably any serializer that returns str or bytes is also valid. However, the following serializer using orjson returns bytes and bulk API only accepts strings as it tries and encode the data using utf8.How can one reproduce the bug?
Return bytes from the Json Serializer instead of a string.
What is the expected behavior?
I'm not sure what the correct answer is but it should either only accepts strings or have a separate code path if the type is bytes
What is your host/environment?
Debian 5.10.191-1
Do you have any screenshots?
Do you have any additional context?
Add any other context about the problem.