Closed migurski closed 8 years ago
What do you think about having a semi-secret "timeout" parameter on the source that specifies how long the caching process should take? Throwing stuff that is expected to take a long time into another queue (or having them run less frequently) can be another ticket?
The tricky part is the way it interacts with the timeouts on EC2 machines in an auto-scaling group. Those are outside of the code and defined as three hours, so we need to figure out a smarter way to determine when the queue is empty and machines can be killed off.
Details on workers here: https://github.com/openaddresses/machine/blob/master/docs/components.md#worker
One thing I have considered is the use of a persistent store to track number of current workers working on current things, which should make it easier to track actual processing time.
That would need to happen someplace inside here, possibly using a thread or signal to send pings: https://github.com/openaddresses/machine/blob/master/openaddr/ci/worker.py#L55-L68
If we put these tasks on Amazon's SQS we could use the length of the SQS queue to determine how many machines to spool up in the Auto Scaling Group.
We already do that. The hard part is figuring when it’s safe to spin them down.
FYI, scaling group is defined here: https://console.aws.amazon.com/ec2/autoscaling/home?region=us-east-1#AutoScalingGroups:id=CI+Workers+2.x;view=policies
Gotcha. I thought we were using Postgres for queueing.
There's some interesting tidbits in this article about auto-scaling Jenkins workers within an ASG: http://sysadvent.blogspot.com/2015/12/day-21-dynamically-autoscaling-jenkins.html
Basically, instead of watching external indicators like the SQS length, the worker could push information about the number of running tasks to CloudWatch, which the ASG could use to trigger scaling events.
You’re right, we do use Postgres, but we also use the size of the queue to determine when to spin things up. Does SQS offer some advantage here, for example a way to show queued items that are being worked on? I haven't figured out a way to make that work well with the extremely long job times we have; most queuing systems seem to assume fast runs and fast retries.
I think we’re thinking about the same thing though, with workers keeping Cloudwatch informed about the current size of the pool.
…maybe through the addition of a fourth cloudwatch metric, something like current work
: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metrics:metricFilter=Pattern%253Dopenaddr.ci
Yep, as I'm reading more of your doc (which is great, by the way) I'm seeing you're going for the same thing as what I'm thinking about.
In theory the worker should know that it's still working on something. On SQS, you keep extending the visibility timeout on the task being worked on and look at the "in flight" metrics in Cloud Watch. In our current system, we would just throw an "active workers" counter in Cloud Watch. (You just posted this as a comment as I was typing :smile:)
In either case, we would want to scale up when "available tasks" is above zero and scale down when "active workers" is zero. And then you also want to leave worker machines up for an hour (because AWS billing rounds to the hour).
Might be time to look into SQS again. One reason we avoided it was to make automated testing easier, but this might be a big advantage. Right now, we scale down when “available tasks” has been zero for longer than the default job timeout, which is not great.
…but probably adding a fourth metric that combines queued tasks and current work would be better. Scale down when it’s been at zero for some amount of time.
I’ve added and deployed a heartbeat queue; it doesn’t do anything useful yet.
Verified that heartbeats are working this morning:
2016-03-03 15:47:56,740 INFO: Got heartbeat 169295: {u'worker_id': u'0x12c25ad5a641'}
2016-03-03 15:52:57,751 INFO: Got heartbeat 169296: {u'worker_id': u'0x12c25ad5a641'}
2016-03-03 15:57:09,446 INFO: Got heartbeat 169319: {u'worker_id': u'0x64dc7437469'}
2016-03-03 15:57:56,525 INFO: Got heartbeat 169320: {u'worker_id': u'0x12c25ad5a641'}
2016-03-03 16:02:08,700 INFO: Got heartbeat 169369: {u'worker_id': u'0x64dc7437469'}
2016-03-03 16:07:09,050 INFO: Got heartbeat 169370: {u'worker_id': u'0x64dc7437469'}
Next step is to add a table where we can see those over time, and use that to populate a new Cloudwatch metric.
Making progress, got the new heartbeats table up and deployed. Next step will be to tighten up the timing on the worker loop and to start pushing data about the worker count someplace useful.
Going to let this sit for the weekend to accumulate some data and to see how the end of a batch run looks in the metrics; it might soon be possible to switch to driving auto-scaling behavior from the new expected results
metric, a simple sum of queued tasks and active workers.
Cloudwatch active workers
metric can now determine the end of a batch set, so the timeout value can be decoupled from the autoscale shrink configuration.
Certain country- and state-wide sources routinely take longer than three hours to process. There should be a way to run these, either manually or on some other, slower queue.