Send Logger output from Cavalcade to CloudWatch

joehoyle commented 7 years ago

Building on https://github.com/humanmade/hm-platform/pull/9 it would be good to send the log data to cavalcade. Rather than sending the cavalcade log file (or maybe as well as) it would be good to send implement a custom Cavalcade Runner Logger to be able to send to /humanmade-production/cavalcade with a Stream ID of the job ID. We'd then output all the workers output and stuff there.

Currently cavalcade doesn't allow the logger to output the worker stdout / stderror but I'm sure this should be added.

nathanielks commented 7 years ago

I'd actually be more of a fan to continue logging to disk and then use a log collector that forwards logs on to CloudWatch. Logging in the middle of the job introduces tight coupling between AWS services and Cavalcade (which I know we're not :100: against, but in general as a design principle, I'm not a huge fan). It also introduces an additional network connection in the middle of a job (multiple times) which can unnecessarily prolong the length of the job, and if for whatever reason CloudWatch is down, that can cause unexpected issues. This is why I'm generally a fan of using a log collector agent vs logging directly to the service itself inside the application. It allows us to compose tools instead of tightly couple them together.

Thoughts?

rmccue commented 7 years ago

Agent makes sense to me personally, although it may be tougher to write than just writing an extension for Cavalcade Runner.

joehoyle commented 7 years ago

So, there's a couple issues with the log collector approach:

If we use actual cavalcade disk logs, they are not very usable as they are difficult to follow due to it being a log for the Runner. All jobs are overlapping and mixed together.
The amount of data we can get from the text-log will always be less, because it's going via a lossy text log

I see the disk logging collection a good starting point to at least get those text logs off the machine and into cloudwatch, but I see specific, more application level logging as a way to get richer, easier to understand data into cloudwatch. If we implement it at the Job level like this, we can potentially show jobs in Vantage, with the output and log entry for each one. It would be incredibly difficult to do that if it's plain text logs shipped agnostically to cloudwatch.

For the network latency, we are doing non-blocking requests in a UDP style for Cavalcade Metrics which isn't so bad too. We could also look at deferring / batching the sends in the cavalcade runner. After all, the process stays open for the life of cavalcade so we could batch quite easily while keeping the rich data.

joehoyle commented 7 years ago

Last note too - logs have to be pushed to CloudWatch streams with a previousToken of the previous entry, this makes it very difficult for multiple processes to push to a single log stream. We'd likely need to split the cavalcade collector shipping to be per-instance which might make the data even more messy.

nathanielks commented 7 years ago

@joehoyle we encountered a similar issue at my old job regarding multiple workers writing to the same file. We used supervisord to spawn multiple worker processes, had the workers write their logs to STDOUT, and then supervisord redirected the logs to an indexed log file (eg worker-1.log, worker-2.log). If we follow the 12 Factor App methodology, we would log to STDOUT and let the machine take care of the rest.

If we used a logging library like https://github.com/Seldaek/monolog, we'd be able to write out structured logs (like we would with the CloudWatch logger). That'd be my vote.

nathanielks commented 7 years ago

Note: we could implement a similar logging process using just upstart/systemd like Sidekiq does: https://github.com/mperham/sidekiq/tree/master/examples/upstart

They have one upstart job that launches N worker jobs.

nathanielks commented 7 years ago

We could also look at deferring / batching the sends in the cavalcade runner.

I'd feel much better about using CloudWatch logs directly if we did it that way 👍

nathanielks commented 7 years ago

Also of value (for getting inspiration on how to defer/batch) would be to look at the options Segment provides for sending analytics data to their service: https://segment.com/docs/sources/server/php/#configuration

(scroll down a bit to look at the "Consumers")

nathanielks commented 7 years ago

I realized I didn't finish an earlier thought. Regarding writing to multiple log files, by writing to worker-1.log, worker-2.log, etc, we could specify each of those logs to the collector agent. Also, by using structured logging, we could send json objects (which would effectively be the same as using CloudWatch logs directly, as long as I'm understand how it behaves correctly).

(sorry for the multiple messages)

humanmade / hm-platform

Send Logger output from Cavalcade to CloudWatch #10