istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

Why encoding byted body #229

Closed YanzhongSu closed 4 years ago

YanzhongSu commented 4 years ago

In pipelines.py, why the body first needs to be converted to bytes and then use base64 encoding?

Can we not store the body(by default it is ) itself directly? what happens if we just leave as it is? My understanding is if we transmit the body itself, the data might be corrupted during transmission.

if self.use_base64:
    datum['body'] = base64.b64encode(bytes(datum['body'], 'utf-8'))
    message = ujson.dumps(datum, sort_keys=True)
madisonb commented 4 years ago

The python 3 docs have the following: https://docs.python.org/3/library/base64.html#base64.b64encode

base64.b64encode(s, altchars=None) Encode the bytes-like object s using Base64 and return the encoded bytes.

A bytes-like object is https://docs.python.org/3/glossary.html#term-bytes-like-object

bytes-like object An object that supports the Buffer Protocol and can export a C-contiguous buffer. This includes all bytes, bytearray, and array.array objects, as well as many common memoryview objects.

I think from the docs it makes sense as to why we encode the value into bytes before passing it into the function. This may also have been crossover between python2 and python3 string compatibility and it is just easier to say "everything is always bytes."

If this answers your question please close the ticket

YanzhongSu commented 4 years ago

@madisonb Thank you for your answer.