tarantool / tarantool-python

Python client library for Tarantool
https://www.tarantool.io
BSD 2-Clause "Simplified" License
100 stars 48 forks source link

Unpack for binary data #105

Closed copyhold closed 2 years ago

copyhold commented 6 years ago

Hi,

My data in tarantool stored as binary strings, they are not UTF in any manner. It's just a sequence of bytes. When trying to select data with this module , unpacker fails with error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tarantool/space.py", line 75, in select
    return self.connection.select(self.space_no, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tarantool/connection.py", line 775, in select
    response = self._send_request(request)
  File "/usr/local/lib/python3.6/dist-packages/tarantool/connection.py", line 357, in _send_request
    return self._send_request_wo_reconnect(request)
  File "/usr/local/lib/python3.6/dist-packages/tarantool/connection.py", line 264, in _send_request_wo_reconnect
    response = Response(self, self._read_response())
  File "/usr/local/lib/python3.6/dist-packages/tarantool/response.py", line 62, in __init__
    self._body = unpacker.unpack()
  File "msgpack/_unpacker.pyx", line 519, in msgpack._unpacker.Unpacker.unpack
  File "msgpack/_unpacker.pyx", line 499, in msgpack._unpacker.Unpacker._unpack
msgpack.exceptions.UnpackValueError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Can this be resolved somehow?

31hkim commented 4 years ago

Any success with fixing that? Still have that problem with Tarantool 2.2 and connector 0.6.6 even when I don't select binary data from a space containing it

Totktonada commented 4 years ago

It seems, we cannot just set encoding to None when creating a connection, because at least schema reload does not support this.

However we can temporary set encoding to None:

diff --git a/unit/suites/test_dml.py b/unit/suites/test_dml.py
index 31e821e..1264180 100644
--- a/unit/suites/test_dml.py
+++ b/unit/suites/test_dml.py
@@ -70,6 +70,12 @@ class TestSuite_Request(unittest.TestCase):
                     self.con.insert('space_1', [i, i%5, 'tuple_'+str(i)])[0],
                     [i, i%5, 'tuple_'+str(i)]
             )
+        saved_encoding = self.con.encoding
+        self.con.encoding = None
+        t = [1001, 1, b'\xff']
+        self.assertEqual(self.con.insert('space_1', t)[0], t)
+        self.con.encoding = saved_encoding
+
     def test_00_03_answer_repr(self):
         repr_str = """- [1, 1, 'tuple_1']"""
         self.assertEqual(repr(self.con.select('space_1', 1)), repr_str)
@@ -122,6 +128,12 @@ class TestSuite_Request(unittest.TestCase):
             [[200, 0, 'tuple_200'], [205, 0, 'tuple_205']]
         )

+        saved_encoding = self.con.encoding
+        self.con.encoding = None
+        res = self.con.select('space_1', 1001)
+        self.assertSequenceEqual(res, [[1001, 1, b'\xff']])
+        self.con.encoding = saved_encoding
+
     def test_03_delete(self):
         # Check that delete works fine
         self.assertSequenceEqual(self.con.delete('space_1', 20), [[20, 0, 'tuple_20']])

NB: File another issue for Connection(host, port, encoding=None) support.

31hkim commented 4 years ago

Now I'm getting this:

>> print(result)
- None
- b'Type mismatch: can not convert 'REDACTED BINARY STRING' to varbinary'

And honestly I don't understand what encodings have to do when it comes to raw binary data. Like shouldn't it be just directly converted to bytes Python type?

Totktonada commented 4 years ago

Okay, two different problems are discussed here:

  1. A user store binary (non-utf-8 data) in a 'string' field and don't want data being encoded or decoded when working with this field. May be worked around as shown above.
  2. A user want to send / receive mp_bin to work with a 'varbinary' field. Everything looks okay on decoding side (mp_bin is decoded to bytes), but a bytes value is encoded as mp_str. Maybe we can change it on Python 3 (RFC patch).
31hkim commented 4 years ago

Ah, sorry for confusion, and thank you. Will wait for a complete fix

31hkim commented 4 years ago

Any success?

Totktonada commented 4 years ago

We're at planning of Q3 tasks and I hope this one will be included here. I'll specifically mark it as important.

Totktonada commented 4 years ago

NB: Propose to msgpack-python to provide helpers like msgpack.as_bin() and msgpack.as_str() to encode a particular structure (dict, list, tuple) member as mp_bin or mp_str disregarding use_bin_type option. It may be requirement of an msgpack-based protocol to have a string here and a binary value there (exactly our case).

Usage example:

import msgpack
msgpack.dumps(dict(foo='string', bar=msgpack.as_bin('binary')))
Totktonada commented 3 years ago

I think regarding this once again and it seems that encoding of bytes as mp_bin would be good idea in context of Python 3. We should not break compatibility, though. Changing defaults is nightmare. Let's give a meaningful error message instead, that would direct a user to use appropriate option.

NB: Think regarding a patch of this kind:

diff --git a/tarantool/request.py b/tarantool/request.py
index d1a5a82..c54a47b 100644
--- a/tarantool/request.py
+++ b/tarantool/request.py
@@ -79,26 +79,11 @@ class Request(object):

         packer_kwargs = dict()

-        # use_bin_type=True is default since msgpack-1.0.0.
-        #
-        # The option controls whether to pack binary (non-unicode)
-        # string values as mp_bin or as mp_str.
-        #
-        # The default behaviour of the connector is to pack both
-        # bytes and Unicode strings as mp_str.
-        #
-        # msgpack-0.5.0 (and only this version) warns when the
-        # option is unset:
-        #
-        #  | FutureWarning: use_bin_type option is not specified.
-        #  | Default value of the option will be changed in future
-        #  | version.
-        #
-        # The option is supported since msgpack-0.4.0, so we can
-        # just always set it for all msgpack versions to get rid
-        # of the warning on msgpack-0.5.0 and to keep our
-        # behaviour on msgpack-1.0.0.
-        packer_kwargs['use_bin_type'] = False
+        # XXX: Write an explanation here.
+        # XXX: Pass it from a connection constructor args, but keep
+        # use_bin_type=False behaviour for Python 2 (because string
+        # literals are bytes).
+        packer_kwargs['use_bin_type'] = True

         self.packer = msgpack.Packer(**packer_kwargs)
ObjatieGroba commented 3 years ago

There are no need to send extra arguments (as raw = True) into msgpack.

data = ['a', b'b'] assert msgpack.unpackb(msgpack.packb(data)) == data

Major diff at 1.0.0 version is encoding option is removed. UTF-8 is used always.

That is why it is better not to pass any arguments into Decoder if encoding is 'utf-8' and version of msgpack greater than 1.0.0

serge-name commented 3 years ago

Hello, any progress on this issue?

I would suggest to bump msgpack to 1.0.0 and to remove that use_bin_type=False line. Otherwise we have no chance to store binary data.

Current behavior for tarantool-python 0.7.1:

@>>> c.insert("some", ("foo", pickle.dumps(42)))
Traceback (most recent call last):
…
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
@>>> c.delete("some", "foo")
Traceback (most recent call last):
…
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
@>>> c.delete("some", "foo")      # already deleted

@>>>

The fix makes everything working:

@>>> c.insert("some", ("foo", pickle.dumps(42)))
- ['foo', b'\x80\x04K*.']
@>>> c.delete("some", "foo")
- ['foo', b'\x80\x04K*.']
Totktonada commented 3 years ago

Sorry, no progress yet. I need to dive into the topic again.

serge-name commented 3 years ago

After the posting that PR, I realised that it is incomplete. I found several checks for different msgpack version in the tarantool-python code. I suggest two solutions:

  1. add another version check to set use_bin_type=False only if msgpack is older than 1.0.0, just only one condition line.
  2. transition to msgpack>=1.0.0 and remove all obsolete version checks.

I'm ready to send either PR, just tell me what you prefer :)

Yesterday I tried to use Tarantool in my company's projects and right now I have to use a private fork of python-tarantool with my fix applied. Our projects need to store pickled python objects in Tarantool.

I hope that solution will emerge very soon.