voxpupuli / puppet-r10k

Setup and configure r10k for use with git based environments in puppet
https://forge.puppet.com/puppet/r10k
Apache License 2.0
99 stars 169 forks source link

Webhook timeout with Gitlab #282

Open esalberg opened 7 years ago

esalberg commented 7 years ago

We had an issue where the default webhook timeout on Gitlab (10 seconds) was causing the webhook to fire multiple times and interfere with the creation of new directory environments.

For the short term, this has been resolved by increasing the Gitlab webhook timeout, but there's a possibility in the future that we might have other webhooks that we want to run that need a timeout shorter than 180 seconds (we may be able to reduce the timeout a bit, but some of the initial runs can take a while).

I wanted to bring up this Gitlab ticket and see if it might be worth modifying the webhook, at least for Gitlab (or an option to do so, since the webhook is a template):

https://gitlab.com/gitlab-org/gitlab-ce/issues/2229 "It's probably a better practice to change your Webhook to return 200 as soon as the message is successfully received, unless you really want the hook to retry upon some failure in the execution."

Is modifying the webhook so there's at least an option to return 200 right away something that might be possible? (I haven't written a webhook before myself, but if it's something where a PR would be helpful, I could definitely look at it - although I might not write it the best way.)

rnelson0 commented 7 years ago

@esalberg The webhook does not return the 200 immediately because it determines the return code based on the result of the full deployment. It is possible that the webhook will fire and start to operate as you'd expect, but then fail near the end and return a non-200 code.

This isn't necessarily the optimal way to do this, but that is the reasoning behind it. I will try and do some research to determine the correct method of response for webhooks to see if we should continue with this or change the response sequence. If you or anyone else finds some good information on this subject, please let me know. Thanks!

esalberg commented 7 years ago

Thanks - even with the timeout on the webhook increased, we're still seeing evidence of overlapping r10k runs, just not quite as frequently. I'm trying to find time to do further testing to validate that the webhook is the culprit (as suggested by Puppet Support).

Per the Gitlab documentation, Gitlab doesn't care about the return code at all, so is that just being used for logging? https://docs.gitlab.com/ce/web_hooks/web_hooks.html Your endpoint should send its HTTP response as fast as possible. If you wait too long, GitLab may decide the hook failed and retry it. GitLab ignores the HTTP status code returned by your endpoint

I'm not 100% sure how to interpret the last sentence - unless they're viewing response code or status code as two separate things, I'm guessing it means that any valid HTTP response would be sufficient (successful, unsuccessful, etc. - as long as it gets back to Gitlab)?

If I wanted to test returning an HTTP response immediately, would I just need to put a "status 202" line (or similar) right after this code line in the webhook bin template? command = "#{$command_prefix} mco r10k deploy #{branch}"

This way, I might at least be able to confirm that it's the webhook taking time to return that's causing the issue with deploying branches consistently across the masters, versus something else in the environment.

Another related thread: https://gitlab.com/gitlab-com/support-forum/issues/350

rnelson0 commented 7 years ago

Re: overlapping runs. Is that because two PRs merged at the same time, or because of git lab retrying? If the former, try using the branch from #268 to see if it addresses it. If the latter, that won't fix the underlying issue but may reduce side effects.

Github does care about the return code as far as logging. You can view a webhook and see the data sent and return code received for every event. It otherwise doesn't care. I think this is one of those things where we have to decide, are we signaling that the webhook was received, or that the action of the webhook completed successfully? We are doing the latter currently.

esalberg commented 7 years ago

That branch looks great - I'll test it out asap.

Since the webhook is a template, I would think it could be possible to have an option to forego the extra logging for quicker webhook return? Basically, let us decide whether we want to signal on option #1 or #2 instead of only using option #2. Granted, this may be moot if the updated code helps.

rnelson0 commented 7 years ago

@esalberg If that does not work, or you'd rather just have it return early always, I whipped up this gist. You could save that as a template somewhere in your controlrepo, maybe in the profile class, and then add r10k::webhook::bin_template: 'profile/webhook.bin.early_return.erb' or equivalent to test with it.

I wrote this on the road, away from where I can test it myself, so please keep that in mind - you'll definitely want to try it in a lab environment or be ready to roll back quickly if it breaks.

rnelson0 commented 7 years ago

@esalberg Any chance you have been able to try this? Thanks.

esalberg commented 7 years ago

Sorry about that - I got pulled off onto some other things, one of which was upgrading to PE 2016.4.2.

Could there be an update to r10k in 2016.4.2 (versus 2016.2.1) that might have addressed the issue with multiple r10k runs writing to the same place? Instead of the previous log lines about local changes and needing to check them in, now I see a new type of message:

Overwriting local modifications to /etc/puppetlabs/code/environments/logstash/modules/augeasproviders_grub

Unlike before, it does resolve the conflict (overwriting local changes is the correct action).

I still see errors like the following: Cannot write /etc/puppetlabs/code/environments_hiera/release42/.git/objects/info/alternates; parent directory does not exist

I wanted to reproduce the issue before trying the patch, but I have not been able to reproduce so far, even though it looks like multiple webhook runs can still occur. I can try reducing the timeout back down and see if that causes any issues - unfortunately, I'm out on PTO until January, so I won't be able to test until then.

UPDATE: when I ran simultaneous pushes for the puppet_control and puppet_hieradata repos (both using r10k), I had 3 of 6 prod servers and 1 of 3 test servers return the following for my test branch: ERROR -> Cannot write /etc/puppetlabs/code/environments_hiera/zzzzbranch/.git/objects/info/alternates; parent directory does not exist

Process output showing multiple webhook deploys:

peadmin 24171 9871 0 01:05 ? 00:00:00 sh -c umask 0022; mco r10k deploy zzzzbranch peadmin 24172 24171 0 01:05 ? 00:00:00 r10k
root 24194 31809 3 01:05 ? 00:00:01 /opt/puppetlabs/puppet/bin/ruby /usr/bin/r10k deploy environment zzzzbranch -p peadmin 25288 9871 0 01:06 ? 00:00:00 sh -c umask 0022; mco r10k deploy zzzzbranch peadmin 25292 25288 0 01:06 ? 00:00:00 r10k
root 25398 31809 5 01:06 ? 00:00:01 /opt/puppetlabs/puppet/bin/ruby /usr/bin/r10k deploy environment zzzzbranch -p root 31331 24194 0 01:06 ? 00:00:00 git clone https://github.com/dalen/puppet-puppetdbquery.git /etc/puppetlabs/code/environments/zzzzbranch/modules/puppetdbquery --reference /opt/puppetlabs/server/data/r10k/https---github.com-dalen-puppet-puppetdbquery.git root 31483 25398 0 01:06 ? 00:00:00 git --git-dir /opt/puppetlabs/server/data/r10k/https---github.com-puppetlabs-puppetlabs-reboot.git fetch origin --prune

esalberg commented 7 years ago

I implemented the latest r10k module code + fork and was not able to reproduce either the multiple webhook deploys or the "parent directory does not exist" error, which I was able to do pretty consistently before the patch (even with a quick test). So I'm quite optimistic that this will work for us.

However, I had been using "include ::pe_r10k", which no longer is allowed (we were on an older version pre-voxpupuli). For the test, I switched to "include ::r10k" - but is it possible to use the pe_r10k module bundled with Puppet Enterprise and the r10k module's webhook (r10k::mcollective, r10k::webhook, etc.) at the same time?

I won't have a chance to test dropping the gitlab timeout (which I need to do) or the gist for the webhook return (which I probably don't need to do, as long as the locking continueds to work) until January.

rnelson0 commented 7 years ago

I don't know anything about the pe_r10k class so cannot comment on that aspect. However the signs look good for the patch fixing things. Given you probably are focused on any remaining PE upgrade issues and then some vacation time, let's pick this back up in January. Thanks! -- Rob Nelson

esalberg commented 7 years ago

I was referring to this in init.pp: (# Check if user is declaring both classes (# Other classes like r10k::webhook is supported but (# using both classes makes no sense unless given pe_r10k (# overrides this modules default config if defined(Class['pe_r10k']) { fail('This module does not support being declared with pe_r10k') }

My masters all have: include ::pe_r10k include ::r10k::mcollective

and the MoM has in addition: class {'r10k::webhook::config': class {'r10k::webhook':

The issue appears to be in both webhook.pp and mcollective.pp, which both have: require ::r10k

Commits where this was added (we are using an older commit): mcollective.pp: https://github.com/voxpupuli/puppet-r10k/commit/d1fe76d7dee828250bb7000cc93500bd5a7248b5#diff-dca4e79d8eeeb74d4d8d77cf12eb6e14

webhook.pp: https://github.com/voxpupuli/puppet-r10k/commit/a2d4ce9b33ebca8186113e6d3204f10b8222401c#diff-99d1e71cb836710a60b0805c80d9b560

I noticed that the locking code was just merged today, so that's great. Any idea what should be done about that new require (I'm not sure why it was added, so I don't want to just say remove it - I'm assuming it was put in for a reason) so that I can move to the latest version of the module?

Do you want me to move this to a new issue?

rnelson0 commented 7 years ago

Yeah, let's make that a new issue. I'll try to test that change before a new release goes out.

Edit: New issue opened. Is it okay to resolve this issue, now?

cova-fe commented 6 years ago

I'm facing the same issue: gitlab tiemouts because the webhook answers after 10 seconds or so. Basically, mco r10k blocks until all deploy are ended, and this is way too slow to avoid gitlab issues. I can avoid this problem by forcing a fork instead of using Open3.capture3, but to do that I had to modify the code. NOw I'm a bit perplexed on how to avoid this situation. Basically, it would be better if not optimal that the webhook returns immediately, and the following operations should manage also the locking.. what do you think?

0x6d617474 commented 5 years ago

This should be fixed as of #425

kenyon commented 4 years ago

Close per previous comment?