opt-elixir / faktory_worker

Elixir Faktory worker https://hexdocs.pm/faktory_worker
MIT License
38 stars 12 forks source link

Add warn/error level logging in telemetry to appropriate events #166

Closed notactuallytreyanastasio closed 2 years ago

notactuallytreyanastasio commented 2 years ago

This comes in response to some problems we had in production. Ideally, we should have warning/error level logs for certain events.

This does the following:

It maintains the same formatting as the info level helper already set up, as to not mess up any saved searches folks may have, but gives this granularity for easier inspection in situations like mass job-failure.

Aside

The problems we were seeing this stems from largely hinge around the failure of ACK-ing jobs upstream and those log levels being info made it a bit harder to dig into things.

As a next step, we may want to consider a means of adding backpressure handling to the sending of all these messages out when trying to put jobs onto the queue. If we kick off a couple thousand, the failures pile up and in one instance for us led to a partial outage. Its reasonable to assume this is possible when flooding the connection, and #157 made it such that we dont raise and cause a ton of noise/failures in parent applications, but we ideally would have a means where if a user is shoving thousands of jobs up to the queue that they would be able to do so without writing preventative code or a band-aid atop this library's API.