goss-org / goss

Quick and Easy server testing/validation
https://goss.rocks
Apache License 2.0
5.5k stars 470 forks source link

Added timeout for mount command in goss #877

Closed melvilgit closed 3 months ago

melvilgit commented 4 months ago

Checklist

Description of change

  1. We have noticed that mount tests hung and became unresponsive in some of our production VM.
  2. The fix is to add a timeout to mount command similar to addr/command.
  3. Default timeout is 1000 ms
melvilgit commented 4 months ago

@aelsabbahy Could you check this ? Some of our jenkins builds that run goss just hung since mount was not timing out.

aelsabbahy commented 4 months ago

OSS build credits ran out with Travis CI. I'll open a support ticket for them.

Once the credits are available, I'll close and reopen this PR to trigger the build.

aelsabbahy commented 4 months ago

CI is back up and running. Once the build is passed, I can review the code changes.

melvilgit commented 4 months ago

CI is back up and running. Once the build is passed, I can review the code changes.

@aelsabbahy The CI passed. Please have a look.

aelsabbahy commented 4 months ago

Eyeball check, this looks good. I assume this PR works locally for you for your needs. It doesn't seem to break any of the CI tests which adds confidence.

A few minor questions:

  1. What was the behavior before, did it hang forever or did it eventually fail. Trying to understand the change to the default behavior for existing users.

  2. One second seems like a good default to me, I'm wondering how many existing users have working tests that take longer than a second, if any.

  3. Related to the top two points: What causes this issue? perhaps understanding that will lead to a good assessment on a default timeout, or if it's a misuse of the mountinfo package, or something else.

Many thanks for submitting this btw. Just trying to understand the problem better before merging this solution.

melvilgit commented 4 months ago

Thanks @aelsabbahy for the review.

A few minor questions:

  1. What was the behavior before, did it hang forever or did it eventually fail. Trying to understand the change to the default behavior for existing users. Yes, It hung forever in some of our VM. We use goss periodically from our jenkins node and the test just hung on the VM since the mount system call wasn't responding. We eventually had to kill the goss process.
  2. One second seems like a good default to me, I'm wondering how many existing users have working tests that take longer than a second, if any. The issue is intermittent in some of the vms. We run the test in more than 20k vms and only a handful of them have the issue, meaning it is not widespread.
  3. Related to the top two points: What causes this issue? perhaps understanding that will lead to a good assessment on a default timeout, or if it's a misuse of the mountinfo package, or something else. The issue happens in vms with faulty hard drives. We did strace and it was just hung with select call strace on the process is looping at: select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)

Many thanks for submitting this btw. Just trying to understand the problem better before merging this solution.

aelsabbahy commented 4 months ago

Once this branch is updated it looks good to me assuming it correctly solved your issue.

I'm not sure how to reproduce it locally to test the failure situation. 😅

melvilgit commented 4 months ago

@aelsabbahy please take a look and merge the pr if good. Thanks :)

aelsabbahy commented 4 months ago

I'm on travel currently, I will review, merge, and release this in ~11 days or so. I'll make sure no other PRs are merged before this to avoid any more merges/rebases.

Thank you for the contribution, it's nice to see niche problems at scale 😄

melvilgit commented 3 months ago

@aelsabbahy Could you please review this once ?

aelsabbahy commented 3 months ago

Yup, I'll review it once I'm back from travel. I'm flying today, so sometime before the end of this weekend should be ample time for me to take a look at this.