meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
94 stars 68 forks source link

Error Handling and dead letter queues for targets #133

Open MeltyBot opened 3 years ago

MeltyBot commented 3 years ago

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/134

Originally created by @vischous on 2021-05-26 17:40:34


Following up on our Office hours today. Not sure if we want this to be Target only or not your call @aaronsteers

Error Handling especially with SaaS style targets gets pretty interesting. Here's errors you'll hit at some point (one's that I can think about off the top of my head there's tons more, everything you can imagine when you run this stuff at scale)

Connection issues

  1. For HTTP requests: 500 Requests, timeouts in everyway you can imagine (hopefully your libraries have sane defaults for connection timeouts, read timeouts, targets will need to change these at timmes) "Server Busy", "Internal Error", etc
  2. Data Issues for HTTP you'll get response codes all over the place depending on the api but generally something like 406, 403, 404, 400, etc. "User already exists", "Name is invalid (over char limit)", "Unknown Error occured", "Cannot disable user due to them having xyz permissions"

Each of these errors needs to be handled slightly different. Some a simple retry with exponential backoff fixes your problem.

Data issues are something you can't get away from, and for a lot of SaaS apis (lots are not http based by the way, see Active Directory, and more) you'll get data errors that are masked as things like 500 errors.

Functionality that's probably needed:

  1. Error handling strategy for "hard" or "soft" errors. One record failing out of 1000 should still output something to stderr / stdout , and the target process should return a response code of something different than 0, but it's no where near as critical as all 1000 records failing which would need a response code of 1.
  2. Configuration for changing thresholds by users of targets. Everyone has different use cases. Thresholds could be percentage based, hard coded number of rows like >10 rows is a "hard" failure
  3. Retry logic

Some of this "maybe all?" could be handling by a dead letter queue of some sort.

Use cases that I know about today:

MeltyBot commented 2 years ago

View 13 previous comments from the original issue on GitLab

louis-vines commented 1 year ago

What is the status on this feature? Seems like a pretty useful usecase.

WillDaSilva commented 1 year ago

CC @visch @tayloramurphy @aaronsteers

tayloramurphy commented 1 year ago

@louis-vines a first pass for us would like be this issue:

With better exit codes for SDK-based connectors we can start to handle each error better overall. Likely we need to break this issue up into specific proposals and make progress on those. cc @aaronsteers

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

tayloramurphy commented 1 year ago

Still relevant

stale[bot] commented 1 month ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.