meganz / sdk

MEGA C++ SDK
BSD 2-Clause "Simplified" License
1.33k stars 504 forks source link

Python: MegaApi.startStreaming() only gives me a fraction of the data #2612

Open Eboreg opened 2 years ago

Eboreg commented 2 years ago

I am trying to use MegaApi.startStreaming() via the Python bindings, but my MegaTransferListener.onTransferUpdate() reports huge differences between the values returned by MegaTransfer.getDeltaSize() and the lengths of the byte arrays I actually get from MegaTransfer.getLastBytes(), and so only a fraction of the file is actually received.

My debug listener:

class TransferListener(MegaTransferListener):
    def __init__(self):
        self.buffers = []
        self.size = 0
        super().__init__()

    def onTransferStart(self, api: "MegaApi", transfer: "MegaTransfer"):
        logger.info(f"onTransferStart: transfer={transfer}")

    def onTransferUpdate(self, api: "MegaApi", transfer: "MegaTransfer"):
        buffer = transfer.getLastBytes().encode("utf-8", errors="surrogateescape")
        size = transfer.getDeltaSize()
        self.buffers.append(buffer)
        self.size += size
        logger.info(
            f"onTransferUpdate: getTotalBytes()={transfer.getTotalBytes()}, "
            f"getTransferredBytes()={transfer.getTransferredBytes()}, "
            f"getDeltaSize()={size}, "
            f"getLastBytes() length={len(buffer)}"
        )

    def onTransferFinish(self, api: "MegaApi", transfer: "MegaTransfer", error: "MegaError"):
        buffers_size = sum([len(b) for b in self.buffers])
        logger.info(
            f"onTransferFinish: transfer={transfer}, "
            f"error={error}, "
            f"reported size from accumulated getDeltaSize()={self.size}, "
            f"actual total size of received data={buffers_size}"
        )

    def onTransferTemporaryError(self, api: "MegaApi", transfer: "MegaTransfer", error: "MegaError"):
        logger.error(f"onTransferTemporaryError: transfer={transfer}, error={error}")

    def onTransferData(self, api: "MegaApi", transfer: "MegaTransfer", buffer: str, size: int) -> bool:
        return True

I got the .encode() thing I do on the received string from the SWIG docs, so I guess it's the correct way to do it?

I also tried handling the returned data in onTransferData() instead, but it just had the exact same result.

I am testing this out by using a MegaNode belonging to a known file, and sending it to startStreaming() like so:

node = api.getNodeByHandle(150729868582434)
size = api.getSize(node)
# size = 29458186, which is consistent with the size of the actual file
transfer_listener = TransferListener()
api.startStreaming(node, 0, size, transfer_listener)

However, this is some of what the listener above logs:

onTransferStart: transfer=DOWNLOAD
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=28960, getDeltaSize()=28960, getLastBytes() length=0
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=36160, getDeltaSize()=7200, getLastBytes() length=0
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=50640, getDeltaSize()=14480, getLastBytes() length=10
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=57920, getDeltaSize()=7280, getLastBytes() length=38
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=65120, getDeltaSize()=7200, getLastBytes() length=82
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=72400, getDeltaSize()=7280, getLastBytes() length=192
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=79600, getDeltaSize()=7200, getLastBytes() length=94
[... lots of lines cut ...]
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=29407200, getDeltaSize()=7200, getLastBytes() length=445
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=29414480, getDeltaSize()=7280, getLastBytes() length=427
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=29436160, getDeltaSize()=21680, getLastBytes() length=785
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=29443440, getDeltaSize()=7280, getLastBytes() length=68
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=29450640, getDeltaSize()=7200, getLastBytes() length=409
onTransferUpdate: getTotalBytes()=29458186, getTransferredBytes()=29458186, getDeltaSize()=7546, getLastBytes() length=116
Listener.onTransferFinish: transfer=DOWNLOAD, error=No error, continue_event=True
onTransferFinish: transfer=DOWNLOAD, error=No error, reported size from accumulated getDeltaSize()=29458186, actual total size of received data=770162

So as you see, getLastBytes() consistently returns a much smaller amount of data than what getDeltaSize() reports. And I know for a fact that the actual file is 29458186 bytes.

I compiled with the following configure arguments (but I have tried various other combinations as well):

--disable-silent-rules --enable-python --with-python3 --disable-examples --enable-debug --enable-doxygen-html

Is there something obvious I'm missing here? I really hope someone can help me out a little.

Eboreg commented 2 years ago

Update: I find it works fine for text files for some reason, but only if I take the strings returned by MegaTransfer.getLastBytes(), convert them to bytes, and then crop them to the length given by MegaTransfer.getDeltaSize(). Like so:

buffer_str = transfer.getLastBytes()
delta_size = transfer.getDeltaSize()
buffer_bin = buffer_str.encode("utf-8", errors="surrogateescape")[:delta_size]
self.buffers_bin.append(buffer_bin)

I can then do b"".join(transfer_listener.buffers_bin).decode(), which gives me an exact copy of the original text.

Why it fails so miserably for binary files, though, is still a mystery to me.

Eboreg commented 2 years ago

I would like to try building the Python bindings with SWIG_PYTHON_STRICT_BYTE_CHAR, as per the SWIG documentation: http://swig.org/Doc4.0/Python.html#Python_nn77

How to do that is unfortunately beyond my competence at the moment.

Eboreg commented 2 years ago

Been doing a little more debugging, and it seems that whatever generates the return value of MegaTransfer.getLastBytes() stops as soon as it encounters a null character.

E.g. if I do api.startStreaming(node, 0, 1000, transfer_listener), and there is a null at position 10 in the file, I only get characters 0 through 9 in return, even if character 11 is non-null.

I guess this makes sense, as SWIG assumes that a returned char * value is a null-terminated string (source). But that's not really helpful in this case.

Eboreg commented 2 years ago

I managed to build the SDK with #define SWIG_PYTHON_STRICT_BYTE_CHAR. Everything is indeed bytes instead of str now, but unfortunately that didn't solve anything. The returned values still stop at the first null character.

Am I configuring the build wrong? Or is startStreaming() simply not meant to be used for binary files?

From the generated bindings/python/megaapi_wrap.cpp:

SWIGINTERN PyObject *_wrap_MegaTransfer_getLastBytes(PyObject *SWIGUNUSEDPARM(self), PyObject *args) {
  PyObject *resultobj = 0;
  mega::MegaTransfer *arg1 = (mega::MegaTransfer *) 0 ;
  void *argp1 = 0 ;
  int res1 = 0 ;
  PyObject *swig_obj[1] ;
  char *result = 0 ;

  if (!args) SWIG_fail;
  swig_obj[0] = args;
  res1 = SWIG_ConvertPtr(swig_obj[0], &argp1,SWIGTYPE_p_mega__MegaTransfer, 0 |  0 );
  if (!SWIG_IsOK(res1)) {
    SWIG_exception_fail(SWIG_ArgError(res1), "in method '" "MegaTransfer_getLastBytes" "', argument " "1"" of type '" "mega::MegaTransfer const *""'"); 
  }
  arg1 = reinterpret_cast< mega::MegaTransfer * >(argp1);
  {
    SWIG_PYTHON_THREAD_BEGIN_ALLOW;
    result = (char *)((mega::MegaTransfer const *)arg1)->getLastBytes();
    SWIG_PYTHON_THREAD_END_ALLOW;
  }
  resultobj = SWIG_FromCharPtr((const char *)result);
  return resultobj;
fail:
  return NULL;
}

SWIGINTERNINLINE PyObject * 
SWIG_FromCharPtr(const char *cptr)
{ 
  return SWIG_FromCharPtrAndSize(cptr, (cptr ? strlen(cptr) : 0));
}

SWIGINTERNINLINE PyObject *
SWIG_FromCharPtrAndSize(const char* carray, size_t size)
{
  if (carray) {
    if (size > INT_MAX) {
      swig_type_info* pchar_descriptor = SWIG_pchar_descriptor();
      return pchar_descriptor ? 
    SWIG_InternalNewPointerObj(const_cast< char * >(carray), pchar_descriptor, 0) : SWIG_Py_Void();
    } else {
#if PY_VERSION_HEX >= 0x03000000
#if defined(SWIG_PYTHON_STRICT_BYTE_CHAR)
      return PyBytes_FromStringAndSize(carray, static_cast< Py_ssize_t >(size));
#else
      return PyUnicode_DecodeUTF8(carray, static_cast< Py_ssize_t >(size), "surrogateescape");
#endif
#else
      return PyString_FromStringAndSize(carray, static_cast< Py_ssize_t >(size));
#endif
    }
  } else {
    return SWIG_Py_Void();
  }
}

I notice SWIG_FromCharPtrAndSize() is called with a size argument generated by strlen(). And that function of course assumes it's dealing with a null-terminated string. The question is, could and should I do anything differently in order to avoid this? It seems to me like the reasonable thing would be for _wrap_MegaTransfer_getLastBytes() to call SWIG_FromCharPtrAndSize() directly, using the same size as reported by getDeltaSize().

jorgeajimenezl commented 2 years ago

I had this problem recently, my solution is based on some changes to the megaapi_wrap.cpp file, I leave you the patch that I applied to version 3.12.0. U can apply using patch megaapi_wrap.cpp megaapi_wrap.txt megaapi_wrap.txt

Eboreg commented 2 years ago

@jorgeajimenezl Thanks! I was thinking along the same lines myself. Manually patching an auto generated file is of course not the optimal solution, but it's better than nothing. :)

ShareefshaF commented 2 years ago

Thanks dude

On Mon, 23 May 2022, 11:24 pm Robert Huselius, @.***> wrote:

@jorgeajimenezl https://github.com/jorgeajimenezl Thanks! I was thinking along the same lines myself. Manually patching an auto generated file is of course not the optimal solution, but it's better than nothing. :)

— Reply to this email directly, view it on GitHub https://github.com/meganz/sdk/issues/2612#issuecomment-1135054269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWAVHGUAQFPVZJD6JXOG4STVLPLOTANCNFSM5P7BXY2Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>