golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.74k stars 17.5k forks source link

x/build/maintner: isTempError improvements #28994

Open orthros opened 5 years ago

orthros commented 5 years ago

Currently isTempError logs the error and vacuously returns true

func isTempErr(err error) bool {
    log.Printf("IS TEMP ERROR? %T %v", err, err)
    return true
}

This is used in the sync function's error group members in conjunction with loop to determine if the error returned from the various source sync functions is a temporary error or not.

for {
  err := gr.sync(ctx, token, loop)
  if loop && isTempErr(err) {
    log.Printf("Temporary error from github %v: %v", gr.ID(), err)
    time.Sleep(30 * time.Second)
    continue
  }
  log.Printf("github sync ending for %v: %v", gr.ID(), err)
    return err
}

This guarantees that if SyncLoop is called, the routines will loop forever (loop is true and isTempErr always returns true), but it also means that if Sync is called, and any repository sync operation fails, the entire error group will exit. In a corpus that is tracking hundreds of repositories with tens of thousands of issues, a single temporary error in the Sync call will always abort all other routines in the group.

It would be useful to be able to provide criteria for what constitutes a temporary error or not.

I propose something similar to the following:

type errCheck func(error) bool

// SyncLoop runs forever (until an error or context expiration) and
// updates the corpus as the tracked sources change.
func (c *Corpus) SyncLoop(ctx context.Context, check errCheck) error {
    if check == nil {
        check = isTempErr
    }
    return c.sync(ctx, true, check)
}

// Sync updates the corpus from its tracked sources.
func (c *Corpus) Sync(ctx context.Context, check errCheck) error {
    if check == nil {
        check = isTempErr
    }
    return c.sync(ctx, false, check)
}
func (c *Corpus) sync(ctx context.Context, loop bool, check errCheck) error {
    if _, ok := c.mutationSource.(*netMutSource); ok {
        return errors.New("maintner: can't run Corpus.Sync on a Corpus using NetworkMutationSource (did you mean Update?)")
    }

    group, ctx := errgroup.WithContext(ctx)
    for _, w := range c.watchedGithubRepos {
        gr, token := w.gr, w.token
        group.Go(func() error {
            log.Printf("Polling %v ...", gr.id)
            for {
                err := gr.sync(ctx, token, loop)
                                isTempErr := check(err)
                if loop && isTempErr {
                    log.Printf("Temporary error from github %v: %v", gr.ID(), err)
                    time.Sleep(30 * time.Second)
                    continue
                }
                log.Printf("github sync ending for %v: %v", gr.ID(), err)
                                if isTempErr {
                                    err = nil
                                }
                return err
            }
        })
    }
  // Rest of function similarly calls check
}

This would give the consumers of corpus the ability to determine what constitutes a temporary error, but default back to the old behaviour by passing nil to Sync and SyncLoop

gopherbot commented 5 years ago

Change https://golang.org/cl/151658 mentions this issue: maintner: add ability to define custom checks on sync errors

andybons commented 5 years ago

@bradfitz @dmitshur

dmitshur commented 5 years ago

if Sync is called, and any repository sync operation fails, the entire error group will exit ... a single temporary error in the Sync call will always abort all other routines in the group

I agree that behavior of Sync (without loop) is suboptimal in that regard. I hope we can come up with a solution that is better than asking the user to provide a custom isTempErr function.

How would you like Sync to behave if one of the repos runs into an error? Should it retry temporary errors up to a fixed number of times, before giving up (and aborting all other repo goroutines)? Or should it collect errors it runs into, and report them at the end (without aborting other repo goroutines)? Or a mix of the two?

If we decide to make it possible for users to provide a custom isTempErr, perhaps it'd be cleaner to make it an exported field on Corpus rather than a new parameter.