Feature request: Timeouts / Retry

mrzor commented 4 years ago

Greetings!

I'm facing an annoying race condition upon creating my Google Compute Engine project. Even though I activate the compute APIs and the dependency from the shell resource is clearly stated, it requires some extra time to propagate. I'm going to add a sleep as a workaround, but that's unsatisfactory for two reasons : first, it introduces extra delay (could be ready before sleep ends), second, it might just not be enough (could be still pending by the time sleep ends).

Resource > Operation timeouts indicates that some resources support custom timeouts. I'm not sure what the defaults are - but it's certainly something I'm interested in tweaking when adding sleep to my script.

SDK Docs > Resources > Retry indicates that some better support is available to providers for eventually consistent resources. I would love to be able to indicate that exit code N of my script means retry after M seconds, like so:

resource "shell_script" "eventually_consistent_thing" {
  lifecycle_commands {
    # ...
  }

  retry {
    on_exit_code = 30
    retry_delay = 5 # in seconds
  }
}

Such retry logic would allow me to speed things up at the expense of extra API calls. It would also make my config tolerant of edge cases where the propagation delay might just be greater than my hardcoded sleep workaround. I suppose that other eventually-consistent resources would benefit in just the same way.

Thanks for the great plugin, it's a life saver and a better alternative to local-exec when I need state, and a much better alternative to external data source because it .. keeps state. Good work!

scottwinkler commented 4 years ago

resource timeouts are definitely a good idea and not too difficult to implement. retry also sounds like a good idea, except i would probably add a max_retries attribute or something

mrzor commented 4 years ago

Thanks for the speedy answer :)

max_retries seem sensible indeed as a failsafe. I'm not sure how the retry logic intersects with timeouts. IF timeouts are counted against the whole operation - retries included - then max_retries doesn't add much to safety ; just to expressiveness. I'm not sure the corner cases are worth it. On the other hand, IF timeouts are per-attempt, then max_retries seems pretty must-have for safety.

In the context of retrying, passing some additional context through the environment could help the script make a better decision on "should we keep trying or not":

TF_SHELL_RETRY_ATTEMPT : 1 means first attempt, then 2, 3, 4. Conveniently this would allow to implement some flavor of exponential backoff script-side if desired. Also enables script to enforce max_retries if so desired (but provides less safety than handling that aspect in the provider).
TF_SHELL_RETRY_FIRST_ATTEMPT_TS : unix timestamp of first attempt. Enables script to give up before timeout if so desired. Convenient to reconcile multiple attempts in logs (barring some kind of TF_SHELL_INVOCATION_ID that isn't there either)

SebTardif commented 3 years ago

What's the actual behavior with timeout, is the go code hardcoded to timeout after 1 minute?

scottwinkler / terraform-provider-shell

Feature request: Timeouts / Retry #55