Create a way to run sources without time limits

migurski commented 8 years ago

Certain country- and state-wide sources routinely take longer than three hours to process. There should be a way to run these, either manually or on some other, slower queue.

iandees commented 8 years ago

What do you think about having a semi-secret "timeout" parameter on the source that specifies how long the caching process should take? Throwing stuff that is expected to take a long time into another queue (or having them run less frequently) can be another ticket?

migurski commented 8 years ago

The tricky part is the way it interacts with the timeouts on EC2 machines in an auto-scaling group. Those are outside of the code and defined as three hours, so we need to figure out a smarter way to determine when the queue is empty and machines can be killed off.

Details on workers here: https://github.com/openaddresses/machine/blob/master/docs/components.md#worker

One thing I have considered is the use of a persistent store to track number of current workers working on current things, which should make it easier to track actual processing time.

That would need to happen someplace inside here, possibly using a thread or signal to send pings: https://github.com/openaddresses/machine/blob/master/openaddr/ci/worker.py#L55-L68

iandees commented 8 years ago

If we put these tasks on Amazon's SQS we could use the length of the SQS queue to determine how many machines to spool up in the Auto Scaling Group.

migurski commented 8 years ago

We already do that. The hard part is figuring when it’s safe to spin them down.

FYI, scaling group is defined here: https://console.aws.amazon.com/ec2/autoscaling/home?region=us-east-1#AutoScalingGroups:id=CI+Workers+2.x;view=policies

iandees commented 8 years ago

Gotcha. I thought we were using Postgres for queueing.

There's some interesting tidbits in this article about auto-scaling Jenkins workers within an ASG: http://sysadvent.blogspot.com/2015/12/day-21-dynamically-autoscaling-jenkins.html

Basically, instead of watching external indicators like the SQS length, the worker could push information about the number of running tasks to CloudWatch, which the ASG could use to trigger scaling events.

migurski commented 8 years ago

You’re right, we do use Postgres, but we also use the size of the queue to determine when to spin things up. Does SQS offer some advantage here, for example a way to show queued items that are being worked on? I haven't figured out a way to make that work well with the extremely long job times we have; most queuing systems seem to assume fast runs and fast retries.

I think we’re thinking about the same thing though, with workers keeping Cloudwatch informed about the current size of the pool.

migurski commented 8 years ago

…maybe through the addition of a fourth cloudwatch metric, something like current work: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metrics:metricFilter=Pattern%253Dopenaddr.ci

iandees commented 8 years ago

Yep, as I'm reading more of your doc (which is great, by the way) I'm seeing you're going for the same thing as what I'm thinking about.

In theory the worker should know that it's still working on something. On SQS, you keep extending the visibility timeout on the task being worked on and look at the "in flight" metrics in Cloud Watch. In our current system, we would just throw an "active workers" counter in Cloud Watch. (You just posted this as a comment as I was typing :smile:)

In either case, we would want to scale up when "available tasks" is above zero and scale down when "active workers" is zero. And then you also want to leave worker machines up for an hour (because AWS billing rounds to the hour).

migurski commented 8 years ago

Might be time to look into SQS again. One reason we avoided it was to make automated testing easier, but this might be a big advantage. Right now, we scale down when “available tasks” has been zero for longer than the default job timeout, which is not great.

migurski commented 8 years ago

…but probably adding a fourth metric that combines queued tasks and current work would be better. Scale down when it’s been at zero for some amount of time.

migurski commented 8 years ago

I’ve added and deployed a heartbeat queue; it doesn’t do anything useful yet.

migurski commented 8 years ago

Verified that heartbeats are working this morning:

2016-03-03 15:47:56,740    INFO: Got heartbeat 169295: {u'worker_id': u'0x12c25ad5a641'}
2016-03-03 15:52:57,751    INFO: Got heartbeat 169296: {u'worker_id': u'0x12c25ad5a641'}
2016-03-03 15:57:09,446    INFO: Got heartbeat 169319: {u'worker_id': u'0x64dc7437469'}
2016-03-03 15:57:56,525    INFO: Got heartbeat 169320: {u'worker_id': u'0x12c25ad5a641'}
2016-03-03 16:02:08,700    INFO: Got heartbeat 169369: {u'worker_id': u'0x64dc7437469'}
2016-03-03 16:07:09,050    INFO: Got heartbeat 169370: {u'worker_id': u'0x64dc7437469'}

Next step is to add a table where we can see those over time, and use that to populate a new Cloudwatch metric.

migurski commented 8 years ago

Making progress, got the new heartbeats table up and deployed. Next step will be to tighten up the timing on the worker loop and to start pushing data about the worker count someplace useful.

migurski commented 8 years ago

Going to let this sit for the weekend to accumulate some data and to see how the end of a batch run looks in the metrics; it might soon be possible to switch to driving auto-scaling behavior from the new expected results metric, a simple sum of queued tasks and active workers.

migurski commented 8 years ago

Cloudwatch active workers metric can now determine the end of a batch set, so the timeout value can be decoupled from the autoscale shrink configuration.

openaddresses / machine

Create a way to run sources without time limits #298