Feature: Returning informative exit codes

aaronsteers commented 1 year ago

As a way of communicating back to the orchestrator, it would be helpful to have 10-15 predefined exit codes for common failure scenarios.

Guidelines and best practices for exit codes

From https://tldp.org/LDP/abs/html/exitcodes.html:

exit codes 1 - 2, 126 - 165, and 255 [1] have special meanings, and should therefore be avoided for user-specified exit parameters.

From https://unix.stackexchange.com/a/604262/487180:

If you are making something that could be turned into a service, it's good to avoid conflicts with (or reuse meaning from) systemd's exit codes which defines code 2-7,200-242. This link also references BSD codes 64-78.

Therefor, a foundation strategy could be:

Use existing codes when there is an exact match with existing convention. (E.g. Keyboard interrupt)
Use custom Exit Codes between 3 and 125.
For each category, if we think the detailed codes are not fully inclusive, then reserve an "other/general" integer for that category.

Grouping of suggested return types, by category and remediation path

Each of these categories and sub-items could have a distinct return code so that the caller can understand what happened during the requested operation:

Success. (0)
- Return Code: 0
- Orchestrator action: Nothing to do, job was successful.
No-op Warning. (3)
- Proposed Exit Code: 3
- Orchestrator action: Tell the user that zero records were synced. Orchestrator can opt to turn this on without user intervention, treating the return code as a non-fatal warning, optionally record a warning in logs to say that zero records were synced.
- Configuration option:
  - enable_noop_exit_code: True to return non-0 exit code for no-op sync operations. False to return 0. Default is False (no-op sync tracked as success.)
Aborted with Partial Success. (Future) (4-9)
- Description: Signifies that incremental progress was made, despite the connector receiving an abort request. The orchestrator should expect that continually retrying will eventually result in a full sync operation. The connector should not return partial success if retrying will result in an infinite loop. Instead, a "Process Abort" message (defined in the section below) must be sent. Here is how to determine if successive retries will result in an infinite loop:
  1. The sync operation must not be a no-op. Meaning, either one or more FULL_TABLE sync's were fully completed, or one or more INCREMENTAL state messages were successfully delivered as resumable bookmarks.
  2. If repeated, the process must be designed to eventually catch up or report a proper failure message. Meaning one or both of these is true:
    - The tap successfully completed all FULL_TABLE streams and reached at least one resumable bookmark for an INCREMENTAL stream. (Full table syncs may need be ordered first by the tap to prioritize a partial success status. Not yet implemented in the SDK.)
    - OR: The tap uses STATE to resume sync on the same stream where the previous sync left off. (Not yet implemented in the SDK.)
- Orchestrator action: Nothing to do, job was partially successful. Report that at least some progress was made, and user should run again to get more records.
- Proposed Exit Codes:
  - 4 SIGTERM / KeyboardInterrupt received and sync operation was wrapped up successfully; sync is resumable.
  - 5 Max record volume limit reached; sync is resumable. (Additional records available on source.)
  - 6 Max elapsed time limit reached; sync is resumable. (Additional records available on source.)
  - 7-8 Reserved for future use.
  - 9 General/Other. (Additional records available on source.)
Process Abort. (10-19, 130, 137)
- Orchestrator action: Nothing to do, process was aborted by user or by user's config parameters.
- Proposed Return Codes: 10-19, 130, 137
  - 10 Operation aborted due to elapsed time restriction.
  - 11 Operation aborted due to record count restriction.
  - 130 Operation aborted by SIGINT or KeyboardInterrupt (Control+C).
  - 137 Operation aborted by SIGKILL.
  - 12-18 Reserved for future use.
  - 19 General/Other.
Configuration Error. (20-29)
- Orchestrator action: Inform the user to double-check their config. Provide documentation links to the end user to help them resolve.
- Remediation: This category generally requires user action. This does not necessarily imply there's a problem in the tap or target. More likely, the user just needs to take another pass at reviewing the config documentation, and/or double-check their credentials. Worst case scenario, this could indicate stale or incomplete documentation.
- Proposed Return Codes: 20-29
  - 20 Config validation error: missing required value.
  - 21 Config validation error: data type mismatch.
  - 22 Config validation error: validation failed (other).
  - 23 Authentication or authorization error. (Permission denied, password incorrect, etc.)
  - 24 Invalid input file paths. (For instance, the config.json or catalog.json do not exist or cannot be reached.)
  - 25-28 Reserved for future use.
  - 29 General/Other.
Environment Error. (Network, Files, or other Resources) (30-39)
- Orchestrator action: Nothing to do, tell the user what happened so they can take action re: RAM, storage, or networking.
- Remediation: Tell the user the issue: out of memory, out of storage space, or unreachable server.
- Proposed Return Codes: 30-39
  - 30 Out of memory.
  - 31 Out of disk space.
  - 32 Network issue or host-not-found.
  - 33 File not found.
  - 34 File not writeable.
  - 35-38 Reserved for future use.
  - 39 General/Other.
Connector Failure. (1, 40-59, 141)
- Orchestrator action: Tell the user there appears to be a bug in the tap or target.
- Remediation: This class of issues indicates a bug in the connector or in the backend API.
- Proposed Return Codes: 40-69, 141
  - Shared failure codes (taps, targets, and mappers):
  - 40 Singer Spec error in STDIN stream or input files.
  - 41-48 Reserved for future use.
  - 49 or 1 General/Other. Connector experienced unhandled exception.
  - Tap-specific failures:
  - 50 Misshapen data from source system or source data failed validation.
  - 51 Source data processing error.
  - 141 Target stopped listening ("Broken pipe")
    - Orchestrator action: Tell the user that the target appears to have failed. (Check target's exit code for more info.)
  - 52-54 Reserved for future use.
  - Target-specific errors:
  - 55 Data validation error in input stream.
    - Orchestrator action: Tell the user (of the target) that there appears to be a bug in the upstream tap.
    - Remediation: This class of error, raised only by targets, indicates a failure that actually occurred upstream in the tap.
  - 56 Data processing error.
  - 57-59 Reserved for future use.
Application or API Failure (Custom). (60-79)
- These 20 codes can be used for any custom exit codes that connector would like to report. Each connector can emit codes that are specific to their use case, and these do not have to be aligned across applications.
- Proposed to define 2 groups:
  - 60-69 Application Failures (Retriable) - These are likely to succeed if retried later.
  - 70-79 Application Failures (Non-Retriable) - These are unlikely to succeed unless action is taken by the user.
- E.g. Redshift could report "S3 bucket in wrong region" (non-retriable) and "cannot query table due to vacuum operation already in progress" (retriable), while SQLite could report "cannot obtain write access lock (WAL) on filesystem" (retriable).
- Orchestrator action: Treat this as a handled exception from the developer: inform the user what the error code is and ask them to check the logs for more information. Optionally print the tail of the log because this is likely to contain the specific error description as authored by the developer.

For purposes of monitoring and reporting the quality and stability of taps and targets, really only "Connector Failure" codes relevant here. The "Configuration Errors" category might also be a sign of poor docs or outdated docs. Assuming the other errors are correctly raised, all other issue groups are: user errors, OS/container issues, or networking issues.

Why do we need this?

Today orchestrators like Meltano have no way to distinguish what actually happened if a subprocess fails - except for a human to manually read over the detailed log files. By adding this into the SDK, the return code of the subprocess would immediately tell Meltano how to advise the user on next steps. Other orchestrators like Airflow could also incorporate these return codes when deciding whether to attempt a retry, and how to message back to users on next steps.

Regarding "partial success" codes

Details

All of the partial-success codes discussed here, should probably have some config option to let them return `0` status if the caller doesn't care about one or all of the detailed status codes. There are use cases where we want to open up the idea of "partial" success - but importantly to tell the caller of the process what actually happened that made the sync not a "full" success. For instance, if running in lambda , we will need an execution time limit. At the end of that time limit (provided in `config.json`, most likely), we'll expect the tap to try to wrap things up and close out its processes. Its return value in these cases should indicate `0` if all upstream records were successfully received within the window or something non-zero if more records were available which were not synced. An orchestrator like Meltano will also want to know the difference between "Sync complete" and "Sync complete (no data found)". By providing a non-zero return code for the "no data found" case, we let Meltano message this properly to the user - rather than only being able to provide a simple "sync completed" message.

Precedent and existing return code conventions

Details

aaronsteers commented 1 year ago

I've updated the above so that each error type is grouped together with similar errors. And I have added specific notes about the actions Meltano (or another orchestrator) might take on seeing a given code.

In regards to monitoring/reporting connector quality, "Group F" and "Group G" are the ones we'd watch for.

We'd generally also want to watch for Group D (configuration errors) as a sign of poor or outdated documentation.

cc @tayloramurphy, @DouweM, @pnadolny13

tayloramurphy commented 1 year ago

@aaronsteers I made https://github.com/meltano/internal-product/issues/187 to track the meta requirements

aaronsteers commented 1 year ago

@tayloramurphy - sounds good! I've added to the office hours board to collect ideas.

Spec-wise, and in terms of defining the path forward from here:

The last piece I'm not sure of would be which specific codes to use. We could try to find precedent of integer codes used already in prior art, or we could just start fresh and declare a new domain of custom return code integer values.

aaronsteers commented 1 year ago

I've updated the issue description to include a set of proposed (and tentative) exit code integers and reserved ranges.

Feedback and counter-suggestions much appreciated.

laurentS commented 1 year ago

Short comment to say I like this a lot! One situation we would like to report back is when the tap runs out of quota. I'm not sure which code above I'd use for this specific situation (maybe a custom one for the specific tap, though I feel like this use-case might be general enough to consider adding it to the sdk itself?).

meltano / sdk