Introduce error handling recommendations to the spec and improve error handling in existing taps and targets

anthonyp commented 7 years ago

While using taps and targets, there are different types of errors that may occur, some transient and some fatal.

Many errors in Singer-compatible taps and targets are treated as fatal, and this makes for a sometimes frustrating user experience because it means that one random (and possibly transient) failure in the middle of a very long data dump can cause the entire dump to quit unexpectedly.

As part of the spec and in general communication about Singer, it seems crucial to communicate to developers the importance of taps and targets being patient and resilient. When a tap or a target encounters an error that might potentially resolve itself by simply trying again - like an API throwing a random 500, or being down - Singer-compatible code should try as hard as it can to not just give up.

Of course, this won't always be possible. Sometimes errors really are fatal. In this case, it seems that the state functionality is a decent way to deal with re-running a job that failed in the middle without needing to re-load all of the data. That said, it just seems important that if this is Singer's expectation (that state is useful not only for normal delta data loads, but also to pick up on failed loads), then it is communicated as such in the spec so that tap and target developers understand the use cases for which they are developing. In addition, it would then be helpful to advise end users to consider using state functionality even if they are performing only one-off loads.

mdelaurentis commented 7 years ago

Thanks for the suggestion. We could definitely enhance the docs to mention that Taps should be robust against transient API failures. Most of the Taps we've written use the backoff library to retry failed HTTP requests a small number of times. Does that seem like an acceptable solution for most errors? Are there any specific examples you found where a Tap or Target fails when it probably would have succeeded after retrying? I think you're mostly suggesting that we enhance the docs, but I'm wondering if there are any specific Taps or Targets that prompted you to mention this.

anthonyp commented 7 years ago

@mdelaurentis Specifically, I ran into rate limiting errors with the gsheet target, and also some errors with HubSpot (bad code causing one stream to fail while other streams worked fine).

I suppose the level of resiliency required is subjective, but the experience did - at a higher level - prompt me to consider how important it was for tap and target developers to understand the importance of error/exception mitigation. So yes, this is a mostly just a suggestion in regard to documentation and specs. Thanks!

singer-io / getting-started

Introduce error handling recommendations to the spec and improve error handling in existing taps and targets #20