scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 217 forks source link

Serialize meta/headers/cookies using widely used format #82

Open sibiryakov opened 8 years ago

sibiryakov commented 8 years ago

Currently we use pickle, which is limits storage usage only for Python in SQLIteBackend.

vedantrathore commented 7 years ago

@sibiryakov I was interested in solving this, any hints?

sibiryakov commented 7 years ago

In HBaseBackend and message bus protocol we use MsgPack, and I would recommend to move in that direction.

vedantrathore commented 7 years ago

@sibiryakov So I have to just replace PickleType() to msgpack.pack() here?

sibiryakov commented 7 years ago

@vedantrathore that wouldn't work, I believe. And you're able to check this by running tests.

vedantrathore commented 7 years ago

@sibiryakov could you please tell me how to tackle this. I read the sqlalchemy and msgpack documentation but couldn't think of a how to convert the Serializer

sibiryakov commented 7 years ago

Try looking into implementation of PickleType in sqla somehow it uses Python native dump/load methods. So you would need to create a similar interface.

A.

24 февр. 2017 г., в 13:15, Vedant Rathore notifications@github.com написал(а):

@sibiryakov could you please tell me how to tackle this. I read the sqlalchemy and msgpack documentation but couldn't think of a how to convert the Serializer

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

voith commented 7 years ago

Hi @vedantrathore Have a look at this code in sqlalchemy to see how PickleType has been implemented. Try to understand how that works and try to implement something similar for msgpack. I had done something similar for cassandra. I had implemented a PickleType column for cassandra. You can view the code here

voith commented 7 years ago

I had a similar use case today. This is what I came up with. I haven't fully tested this. But IMO, this is what is needed.

from msgpack import packb, unpackb
from sqlalchemy import LargeBinary
from sqlalchemy.sql.type_api import TypeDecorator

class MsgpackType(TypeDecorator):
    impl = LargeBinary

    def bind_processor(self, dialect):
        impl_processor = self.impl.bind_processor(dialect)
        if impl_processor:
            def process(value):
                value = packb(value, use_bin_type=True)
                return impl_processor(value)
        else:
            def process(value):
                return packb(value, use_bin_type=True)
        return process

    def result_processor(self, dialect, coltype):
        impl_processor = self.impl.result_processor(dialect, coltype)
        if impl_processor:
            def process(value):
                value = impl_processor(value)
                return unpackb(value, encoding='utf-8')
        else:
            def process(value):
                return unpackb(value, encoding='utf-8')
        return process

Sadly I don't have the time to make a PR. @vedantrathore Maybe, you can take it from here