This comes in response to some problems we had in production. Ideally,
we should have warning/error level logs for certain events.
This does the following:
log warn when NOTUNIQUE
log warn on heartbeat failure
log error on job failure
log warn on failed ACK with :ok status
log error on failed ACK with :error status
It maintains the same formatting as the info level helper already set
up, as to not mess up any saved searches folks may have, but gives this
granularity for easier inspection in situations like mass job-failure.
Aside
The problems we were seeing this stems from largely hinge around the failure of ACK-ing jobs upstream and those log levels being info made it a bit harder to dig into things.
As a next step, we may want to consider a means of adding backpressure handling to the sending of all these messages out when trying to put jobs onto the queue. If we kick off a couple thousand, the failures pile up and in one instance for us led to a partial outage. Its reasonable to assume this is possible when flooding the connection, and #157 made it such that we dont raise and cause a ton of noise/failures in parent applications, but we ideally would have a means where if a user is shoving thousands of jobs up to the queue that they would be able to do so without writing preventative code or a band-aid atop this library's API.
This comes in response to some problems we had in production. Ideally, we should have warning/error level logs for certain events.
This does the following:
NOTUNIQUE
ACK
with:ok
statusACK
with:error
statusIt maintains the same formatting as the
info
level helper already set up, as to not mess up any saved searches folks may have, but gives this granularity for easier inspection in situations like mass job-failure.Aside
The problems we were seeing this stems from largely hinge around the failure of
ACK
-ing jobs upstream and those log levels beinginfo
made it a bit harder to dig into things.As a next step, we may want to consider a means of adding backpressure handling to the sending of all these messages out when trying to put jobs onto the queue. If we kick off a couple thousand, the failures pile up and in one instance for us led to a partial outage. Its reasonable to assume this is possible when flooding the connection, and #157 made it such that we dont raise and cause a ton of noise/failures in parent applications, but we ideally would have a means where if a user is shoving thousands of jobs up to the queue that they would be able to do so without writing preventative code or a band-aid atop this library's API.