RFC celery: msgpack could support custom types

dset0x commented 9 years ago

Intro

In Invenio we use Celery, which uses kombu to serialize messages. kombu supports a number of backends, of which we use msgpack.

Issue

msgpack supports many native types (strings, lists, etc), but has no support for more complex types such as datetime and enum, or, naturally, any custom types we define in our code. This is an issue because business logic should not contain serialization code.

Solution

Fortunately msgpack us to extend its serialization functionality, by providing 128 "slots" for application-specific types. (spec)

Proposal

With a bit of glue (WIP PoC PR) and perhaps the assistance of flask-registry, modules would be able to define type matchers as well as serializers and deserializers for the types that they will want to use, even if they are specific to themselves.

cc. @jalavik @jirikuncar @hachreak

jirikuncar commented 9 years ago

cc @crepererum

crepererum commented 9 years ago

I did that for another project before and it worked fine. There are some things to keep in mind when doing this:

Performance: Avoid nested/hijacked serialization methods. Bad example: Serialize your custom type to a Python dict and that one to JSON and that one to a string (which then is msgpack compatible).
Security: whatever you do, do NOT use (c)Pickle or marshal to serialize your custom type. Please also try to avoid pulling more C libraries as good as you can, because they are one of the most targeted parts when it comes to serialization exploits. Do not load or execute modules or methods just because a message tells you to do so (that's basically why cPickle is insecure).
Stability: There is no way to figure out which module version submitted a specific message part to the broker and I hope that nobody will attach all that metadata to every single message. If you introduce a new type to the serializer, make sure it is stable enough to survive the next 10 year.
Usage: There are only 128 slots, so use this workaround rarely. If you can, put a warning if we exceed 64 slots. Whenever you can, use a JSON-like structure for your object (i.e. dicts+lists+strings+numbers+booleans+...) which then is handled by msgpack without a custom type.
Slot Allocation: How do you plan to solve this without the need that every module knows all slots allocations of every other module AND the satisfaction of 3. Stability? Keep in mind that we plan to separate modules from the core so that they are working independently (except of dependency relations of course)

dset0x commented 9 years ago

@crepererum Thanks for your very useful input.

`5. That's a very good question. In fact it hints to do this instead: Do to not allow modules to define types, but only serialization/de-serialization methods on their classes (which reduce them to relatively basic types that we support in the core of Invenio, as you describe in (4)).

`3. It seems to me that defining only types that are generic enough to last 10 years limits us to types of the core library. Assuming we only allow types to be defined in Invenio (and not modules), would it not be enough to reload the brokers on upgrades, or is this an irrational request for someone doing admin work?

crepererum commented 9 years ago

@dset0x can you provide a number mapping of your answer, because GitHub Markdown does not process numbered lists which do not follow the 1,2,... pattern :disappointed:

dset0x commented 9 years ago

@crepererum Woops, updated!

crepererum commented 9 years ago

+5. What about an invenio/ext/msgpack module which introduces some commonly used types?

+3. 10y was not meant to be a standard. It was more meant to be like "do not alter the format on every update". About the update procedure: As far as I know we only use msgpack for the broker and celery results (not for the session storage, because that one uses pickle which is assumed to be fine in that case). If we alter the format, the admin has to do the following:

block new incoming broker messages (maintenance mode starts here)
flush entire msg queue
wait until all messages and their results are processed
update invenio (format changes at this point)
unlock broker (maintenance mode ends here)

Alternative: Introduce new format at new slot, keep the deserializer for the old format alive and run serializer@new and deserializer@new+old for one/some version(s). Put a clear note into the update to tell the admin what happens when he skips specific versions.

In general we should find a way to flush the queues, so the first option would be a very useful admin tool anyway. Maybe this is already possible? (cc @lnielsen)

lnielsen commented 9 years ago

Other things to consider is Flower support (i.e. can you inspect the message).

Besides datetime and/or enum what else do you need now? And how does Celery know how to deserialise a type?

I would be very careful to introduce new serializers (i.e goes with "3. stability"). Datetime is one of the few I do see could be very useful.

dset0x commented 9 years ago

5. It may belong out of the celery module, although if its scope is solely celery, we could be littering theinvenio/` directory for no good reason).

`3. There is also the problem that there may be items in the queue that should not be ran until a specific point in time. I like the alternative more.

Assuming an upgrade with no type deletions or changes, but only additions:
```
# Before upgrade:
- Slot1 -> Type1
# After upgrade:
- Slot1 -> Type1
- Slot2 -> Type2
```
Version2 of msgpack_append should be able to run both versions. Am I right in thinking that @crepererum?
Now, for more difficult upgrades, where we break a type, could we perhaps version the content_type, or even CELERY_ACCEPT_CONTENT(not sure which one would be the right one) so that incompatible workers will simply not get work they don't understand how to do?

@lnielsen I haven't used Flower; it looks like it could be of some help, although managing upgrades or issues related to them is not something we'd want to do over a web interface.

Right now datetime and enum are generic enough to add without much thought. On top of these we can probably add some generic type that looks for __dumps__ and __loads__ so that we use that instead of hacks like this: https://github.com/dset0x/invenio/blob/checker-port/invenio/modules/checker/rules.py#L297

lnielsen commented 9 years ago

I haven't used Flower; it looks like it could be of some help, although managing upgrades or issues related to them is not something we'd want to do over a web interface.

Sorry, my message was somewhat short :-) I'm not suggesting to manage the upgrades etc via Flower. Flower is basically for monitoring to see what tasks gets executed in the queue. It's critical to have it in a production environment. So my concern was that Flower should be able to inspect the message parameters out of the box.

In general somewhat concerned about changing to use something else that default msgpack. E.g. if you wanted worker and publisher not to share the same code, then you would suddenly need to specify the msgpack_append on the worker as well. I can see arguments for adding datetime, but I definitely wouldn't allow generic dumps and loads.

dset0x commented 9 years ago

In general somewhat concerned about changing to use something else that default msgpack. E.g. if you wanted worker and publisher not to share the same code, then you would suddenly need to specify the msgpack_append on the worker as well. I can see arguments for adding datetime, but I definitely wouldn't allow generic dumps and loads.

@lnielsen

The workers need to be aware of the application anyway so that they know what tasks they support.
What is the problem with dumps and loads, if their result boils down to known types? Edit: Sorry, I think I was misunderstood about dumps and loads. These will not be methods that return json (already mentioned in (1) of https://github.com/inveniosoftware/invenio/issues/3295#issuecomment-114040194)

crepererum commented 9 years ago

@dset0x in case you just add new types, the update is kind of easy*: just ensure that all consumers are updates before the producers. No need to flush the queue.

*) "easy" might be not that easy in the case you have a circular setup, e.g. because the worker nodes results are somehow pushed back to the web nodes. I am not sure when is can be the case. As far as I remember we ignore the celery results in most of the cases, but that should be checked before. If there is a circular design, then you have to update both at the same time, which is not that funny anymore.

lnielsen commented 9 years ago

The workers need to be aware of the application anyway so that they know what tasks they support.

Yes, but you can e.g. have tasks defined in a self-contained package which Invenio imports on your web node, and then have a reduced/simplified worker without Invenio but just the self-contained package. Then you just tell Celery/RabbitMQ that all tasks of this type is routed to that specific worker. An example this constellation could be e.g. used for wrapping a video encoding system.

My concern is that we're changing from using something completely default out-of-the-box to a custom solutions which we will have to maintain and in addition it introduces issues with e.g. upgrades. I.e. I'm trying to figure out if we really need this extra data type serialisation or if we can live fine without it.

Do you perhaps have some examples of celery tasks that would benefit from the RFC?

dset0x commented 9 years ago

@crepererum Correct me if I'm wrong - I'm not that familiar with celery: If Webnode1 pushes work to Consumer1, it should always be the case that the result of Consumer1 ends up in Webnode1 (and not Webnode2), since "reply" in this sense, is Webnode1 polling for an answer. Unless by "pushed back to the web nodes" you mean a purely hypothetical work model.

@lnielsen

Yes, but you can e.g. have tasks defined in a self-contained package which Invenio imports on your web node, and then have a reduced/simplified worker without Invenio but just the self-contained package. Then you just tell Celery/RabbitMQ that all tasks of this type is routed to that specific worker. An example this constellation could be e.g. used for wrapping a video encoding system.

Do you perhaps have some examples of celery tasks that would benefit from the RFC?

This RFC arose from https://github.com/inveniosoftware/invenio/issues/3288, while we were trying to amend the breakage that was introduced by switching the column type to UUID. All we had as an alternative to this was manually casting the UUID to a string in the business logic. I figured this was unacceptable and wrote the PR linked in the first post.

Aforementioned hack here https://github.com/inveniosoftware/invenio/issues/3295#issuecomment-114074237 . It would be nice to:

Not have to manually call these methods (https://github.com/dset0x/invenio/blob/checker-port/invenio/modules/checker/tasks/tasks.py#L219)

Not have to use json I must add in this use-case that it is not yet certain if the checker module will actually go this way, but at least it hopefully underlines why supporting arbitrary types which know how to serialize themselves would be beneficial.

Come to think of it however, the serialized data would have to conform to some standard (so that it can be identified as Foo class or Bar class). This is actually replicating the msgpack's slots, but on our side instead. Solves some issues, creates some others.

crepererum commented 9 years ago

@dset0x That's half true (or let's say incomplete): Web_1 (w.l.o.g.) pushes to Broker (Redis or RabbitMQ or ...), ONE of the Workers/Consumers (Worker_i in the following) gets the payload from there. If Web_1 needs the result, it gets it back. This is load balancing and you do not know who Worker_i will be. So you cannot update half of the workers without any additional admin work (e.g. separating queues for "old" and "new"). (feel free to ask for a proof here

As a result: If Web_x for any x wants a result sometimes (say with a probability significantly over 0), then all Workers needs to be updates in sync with all Web nodes. Or you need to split your queues, which I think is not that easy.

dset0x commented 9 years ago

@crepererum I could be still (!) missing the bigger picture here, but couldn't we potentially solve this by versioning our protocol (point (3), second bullet)? Although, I'm only guessing that the broker would handle this correctly.

crepererum commented 9 years ago

@dset0x The broker only ensures that no messages are lost, that the load is balanced and that some messages can be delayed/scheduled. it does not know about the format+content of the messages. This is handled by celery (celery is NOT the broker btw) and if we just set msgpack as a message format and wildly modify some slots, than even celery is not aware of our problem.

But (here comes the good part): We could introduce a protocol version (which includes setting the message format to something like invenio-msgpack-1.0) and kinda solve the problem IF celery does the following: "do not touch messages with unkown protocols and leave them in the queue". We have to check if this is satisfied.

dset0x commented 9 years ago

@crepererum

"do not touch messages with unkown protocols and leave them in the queue"

Handling dead-letter exchanges may be what fits the bill here(?)

Regardless, I had a chat with @jalavik about the issue: since it appears that we have no direct, unavoidable reason to get this working right now (in fact we are changing the architecture of the checker, which was one of the two examples above), I will direct my attention at other problems for the moment. We can keep the code around for a time when we have time to experiment with it, or a bigger use-case appears. (But feel free to give it a shot if you like!)

jirikuncar commented 8 years ago

Proposed solution https://github.com/inveniosoftware/invenio/issues/3295#issuecomment-114757309
If needed reopen in Flask-CeleryExt.

inveniosoftware / invenio