internetarchive / umbra

A queue-controlled browser automation tool for improving web crawl quality
Apache License 2.0
60 stars 25 forks source link

Adds routing_key to Queue creation #41

Closed ldko closed 9 years ago

ldko commented 9 years ago

Adds routing_key parameter to AmqpBrowserController and sets it on url_queue. Will resolve #40.

nlevitt commented 9 years ago

Thanks @ldko.

So calling kombu.Queue() with a routing_key tells it to create the binding in rabbitmq?

Umbra actually had a --routing-key option before, but I got rid of it. See https://github.com/internetarchive/umbra/commit/025db91dea15bbb8d80b66081e92a5d684d91ce3 My feeling was that the ability to configure the exchange and queue would be sufficient. What do you think about sticking with a hard-coded value, but passing it into the constructor so that the binding gets created?

Also what do you think about creating the binding for the other queue, the outgoing "requests" queue?

ldko commented 9 years ago

Calling kombu.Queue() in itself doesn't actually create the queue or bindings or anything in RabbitMQ, but when the Consumer is instantiated with url_queue, the Consumer's declare() is automatically called which will call declare() for the queue instances passed to it and via that, the queues are declared and bindings created in RabbitMQ.

Per the AMQP protocol, a message sent to an exchange of direct type with a routing-key header set, will only go to a queue if its routing/binding key matches the headers routing key. The reason I was thinking the "urls" queue needs a routing key set is because in Heritrix's AMQPProducer, the exchange was declared as a named direct exchange with a set routing key of "urls", so we are sending the messages to any queues if they are bound to the exchange but only if they have a routing key that matches "urls". Without setting the routing_key on the kombu.Queue, the "urls" queue created in RabbitMQ was bound to the "umbra" exchange but was not receiving the messages Heritrix was sending.

Is the change you are suggesting to take out the routing-key argument option and hardcoding the routing_key here on line 47? In the context of Heritrix, since on the Heritrix side the routing key is hard coded, it would make some sense to hard code the routing key for the umbra queue. However, there is a routing-key option on the queue-url and queue-json scripts packaged with Umbra, so having the option to set the routing key (binding key) when launching umbra makes more sense to me.

Regarding binding for the "requests" queue, as it is now, Umbra doesn't try to create that queue at all--same as how Heritrix currently doesn't create the "urls" queue that the AMQPProducer sends messages to. I don't think this is uncommon for producers to not declare queues since when sending with a routing key, theoretically you could be sending to multiple queues at once. If we just want to be sure at least one queue exists with the correct routing key, I can implement queue creation by the producers on both sides.

If we decide to have Umbra declare a queue to make sure it exists before publishing it, we would need to do it on each request. Something like create a kombu.Queue() object with name and routing_key set to client_id in on_request(), and when we call publish() pass in the queue instance with declare parameter.

So far, I made a couple changes on the Heritrix side of things just to get rid of the error when starting the AMQP receiver before its exchange existed and to set the default clientId passed with the publisher to be the same as the default queue name set for the receiver here. If we want to declare the "urls" queue with Heritrix, should the "urls" queue name be configurable or hard coded? Are there other AMQP related changes you want to see on the Heritrix side?

nlevitt commented 9 years ago

Thanks for the detailed explanation, that's helpful. It seems reasonable to me for the consumers to be responsible for declaring queue bindings, especially if that's an established convention. As for the question of hard-coding the routing key, I probably overlooked queue-json and queue-url when I switched it to be hard-coded. Either way seems ok to me though. So I'll merge this pull request.

I think your heritrix changes look fine. Looks like AMQPUrlReceiver is already declaring the queue binding. I suppose it kind of has to since they're created more on demand, and there can be many in use at the same time. https://github.com/internetarchive/heritrix3/blob/master/contrib/src/main/java/org/archive/crawler/frontier/AMQPUrlReceiver.java#L128