Grace period timeout occurred in ReportAfterSuite()

colintg commented 8 months ago

Version

ginkgo v2.11.0

Issue

  [TIMEDOUT] A grace period timeout occurred
  In [ReportAfterSuite] at:

  Full Stack Trace

  This is the Progress Report generated when the grace period timeout occurred:
    In [ReportAfterSuite] (Node Runtime: 30.002s)

      At [By Step] Dumping logs (Step Runtime: 25.119s)

Context

Ginkgo is running test cases in 3 workers in parallel, and we're dumping logs for failed cluster in ReportAfterSuite() because we want to dump logs for failed cluster only. Each test case uses a random cluster name which is added to ginkgo by AddReportEntry(), and name of failed cluster is extracted by iterating through SpecReports.

In ReportAfterSuite(), we have a retry logic too - if previous execution fails, program will sleep for 1 min and retry, and there will be 2 retries in total.

Concerns

Is there a timeout for ReportAfterSuite() node in source code, as I didn't explicitly set any timeout for it? It seems that program timed out in 30 sec and failed to exit (maybe due to sleeping?). I couldn't find any timeout in source code, so please advice.
Can I set a customized timeout for ReportAfterSuite() node, like 10min because there are many logs to dump? I investigated NodeTimeout(), but it requires ctx context.Context as a callback, and I don't see where it can fit in in ReportAfterSuite() because it takes only 1 function with 1 arg report types.Report.

Thank you!

yuhanqiutg commented 8 months ago

https://github.com/onsi/ginkgo/blob/master/internal/suite.go#L842 I found the code here, that's why ReportAfterSuite takes default GracePeriod (https://github.com/onsi/ginkgo/blob/master/types/config.go#L51) as its timeout. Is it an expected behavior that every ReportAfter node will timeout according to global GracePeriod? @onsi

onsi commented 8 months ago

Hey there,

The issue is that the entire suite is timing out because of the suite timeout (which you can modify using ginkgo -timout=X). Once the suite times out Ginkgo needs everything to finish up and the ReportAfterSuite is being given a period of time (called the GracePeriod) to finish up.

You can increase the ReportAfterSuite’s GracePeriod like this:

var _ = ReportAfterSuite(“foo”, func(r Report) {
    …
}, GracePeriod(10*time.Minute))

There are more details here: https://onsi.github.io/ginkgo/#spec-timeouts-and-interruptible-nodes

colintg commented 8 months ago

Hi @onsi , thanks for your timely response!

ReportAfterSuite doesn't allow add NodeTimeout or GracePeriod at this time, it would report a runtime error

  Invalid NodeTimeout SpecTimeout, or GracePeriod

  var _ = ginkgo.ReportAfterSuite("", func(report types.Report) {
    [ReportAfterSuite] was passed NodeTimeout, SpecTimeout, or GracePeriod but
    does not have a callback that accepts a SpecContext or context.Context.  You
    must accept a context to enable timeouts and grace periods

I found this in the official doc:

Currently the Reporting nodes (ReportAfterEach, ReportAfterSuite, and ReportBeforeEach) cannot be made interruptible and do not accept callbacks that receive a SpecContext. This may change in a future release of Ginkgo (in a backward compatible way).

Do we have a plan at this moment for realizing this feature in the near future?

onsi commented 8 months ago

oh duh, sorry. I'm not yet planning on making them interruptible. But you should be able to override GracePeriod on them and the fact that you can't is an oversight on my part.

For now you can simpy use ginkgo -grace-period=10m - this will apply to all nodes but I suspect that will probably be OK for you. You can also look into adjusting the total suite timeout so that the interruption doesn't happen any way!

onsi / ginkgo