Open joehoyle opened 7 years ago
I'd actually be more of a fan to continue logging to disk and then use a log collector that forwards logs on to CloudWatch. Logging in the middle of the job introduces tight coupling between AWS services and Cavalcade (which I know we're not :100: against, but in general as a design principle, I'm not a huge fan). It also introduces an additional network connection in the middle of a job (multiple times) which can unnecessarily prolong the length of the job, and if for whatever reason CloudWatch is down, that can cause unexpected issues. This is why I'm generally a fan of using a log collector agent vs logging directly to the service itself inside the application. It allows us to compose tools instead of tightly couple them together.
Thoughts?
Agent makes sense to me personally, although it may be tougher to write than just writing an extension for Cavalcade Runner.
So, there's a couple issues with the log collector approach:
I see the disk logging collection a good starting point to at least get those text logs off the machine and into cloudwatch, but I see specific, more application level logging as a way to get richer, easier to understand data into cloudwatch. If we implement it at the Job level like this, we can potentially show jobs in Vantage, with the output and log entry for each one. It would be incredibly difficult to do that if it's plain text logs shipped agnostically to cloudwatch.
For the network latency, we are doing non-blocking requests in a UDP style for Cavalcade Metrics which isn't so bad too. We could also look at deferring / batching the sends in the cavalcade runner. After all, the process stays open for the life of cavalcade so we could batch quite easily while keeping the rich data.
Last note too - logs have to be pushed to CloudWatch streams with a previousToken
of the previous entry, this makes it very difficult for multiple processes to push to a single log stream. We'd likely need to split the cavalcade collector shipping to be per-instance which might make the data even more messy.
@joehoyle we encountered a similar issue at my old job regarding multiple workers writing to the same file. We used supervisord
to spawn multiple worker processes, had the workers write their logs to STDOUT
, and then supervisord
redirected the logs to an indexed log file (eg worker-1.log
, worker-2.log
). If we follow the 12 Factor App methodology, we would log to STDOUT
and let the machine take care of the rest.
If we used a logging library like https://github.com/Seldaek/monolog, we'd be able to write out structured logs (like we would with the CloudWatch logger). That'd be my vote.
Note: we could implement a similar logging process using just upstart/systemd like Sidekiq does: https://github.com/mperham/sidekiq/tree/master/examples/upstart
They have one upstart job that launches N worker jobs.
We could also look at deferring / batching the sends in the cavalcade runner.
I'd feel much better about using CloudWatch logs directly if we did it that way 👍
Also of value (for getting inspiration on how to defer/batch) would be to look at the options Segment provides for sending analytics data to their service: https://segment.com/docs/sources/server/php/#configuration
(scroll down a bit to look at the "Consumers")
I realized I didn't finish an earlier thought. Regarding writing to multiple log files, by writing to worker-1.log
, worker-2.log
, etc, we could specify each of those logs to the collector agent. Also, by using structured logging, we could send json objects (which would effectively be the same as using CloudWatch logs directly, as long as I'm understand how it behaves correctly).
(sorry for the multiple messages)
Building on https://github.com/humanmade/hm-platform/pull/9 it would be good to send the log data to cavalcade. Rather than sending the cavalcade log file (or maybe as well as) it would be good to send implement a custom Cavalcade Runner Logger to be able to send to
/humanmade-production/cavalcade
with a Stream ID of the job ID. We'd then output all the workers output and stuff there.Currently cavalcade doesn't allow the logger to output the worker stdout / stderror but I'm sure this should be added.