samples tests flaky failures

alexander-fenster commented 6 years ago

Several risk samples test often fail (making about 1 out of 2 nightly runs fail). For example, these two tests failed in the 4 recent nightly runs:

✖ risk › should perform k-map analysis on a single field https://circleci.com/gh/googleapis/nodejs-dlp/1770

✖ risk › should perform k-anonymity analysis on a single field https://circleci.com/gh/googleapis/nodejs-dlp/1753

Having flaky tests is probably worse than having no tests, because, as a result, we never pay any attention to nightly failures. I suggest that we either investigate and fix the tests, or disable the failing tests to have a green CI all the time.

Cc: @crwilcox, @jmdobry

realjordanna commented 6 years ago

I took a gander at the job history for longdoor and don't see failures, however the last line of the sample

return new Promise(resolve => setTimeout(resolve(jobName), 500));

just assumes the job is going to finish in a set amount of time ... but these things can get delayed if there are a lot of other jobs in the queue. By design this test will be brittle as it's hard to guarantee a long running operation completes in a set amount of time, which is why end users would instead just listen to pub sub or poll for the job to complete w/o a deadline. Doubling the timeout would for the most part make it more stable ... although there will be always chances of too many tests running and thus timing out.

alexander-fenster commented 6 years ago

@realjordanna So we have a kind of conflict here as we should at the same time provide a sample small enough to be understood (we do have samples using Pub/Sub but they are obviously more complicated), but at the same time, we must have a test for this sample that always passes, otherwise the continuous integration goes mad.

Having that said, I consider the sample misguiding as the comment says

      // Wait for DLP job to fully complete
      return new Promise(resolve => setTimeout(resolve(jobName), 500));

while it should actually say 'wait for 500 ms in hope that the job completes by that time'. Also, since this code might fail in CI, it might as well fail in the real users' code. Sleeping on constant timeout in hope that it's enough is evil anyway.

@ace-n and @jmdobry, do you think the risk samples can be rewritten to queue the actual job status without making them look too complicated at the same time?

ace-n commented 6 years ago

These samples do in fact listen to PubSub. The problem is that the API as it stands sends the PubSub message as the job results are written to the DB, not after.

TL;DR this is an eventual consistency problem.

The "correct" clientside solution to this is to use some sort of polling (ideally exponential polling with random jitter), but not every language will necessarily have a drop-in library for that.

We can get around this by encoding the logic in the sample itself, but that makes the sample look even more cluttered and lengthy than it is now. (And these samples are, IMHO, quite cluttered and lengthy due to the PubSub logic.)

We can also avoid this by putting the appropriate logic in the client library - which gives us a decent UX at the cost of additional maintenance work.

Of course, the band-aid solution is to a) increase the timeout or b) fix only the languages that have convenient exponential polling libraries.

The serverside solution is to only write to PubSub after the job is readable (as I've suggested in the past). This is also a very user-friendly option. (I'm not sure people want their Cloud Functions triggering on nonexistent results, for example. This prevents that.)

On Sat, May 5, 2018, 11:05 PM Alexander Fenster notifications@github.com wrote:

@realjordanna https://github.com/realjordanna So we have a kind of conflict here as we should at the same time provide a sample small enough to be understood (we do have samples using Pub/Sub but they are obviously more complicated), but at the same time, we must have a test for this sample that always passes, otherwise the continuous integration goes mad.

Having that said, I consider the sample misguiding as the comment says
  // Wait for DLP job to fully complete
  return new Promise(resolve => setTimeout(resolve(jobName), 500));
while it should actually say 'wait for 500 ms in hope that the job completes by that time'. Also, since this code might fail in CI, it might as well fail in the real users' code. Sleeping on constant timeout in hope that it's enough is evil anyway.

@ace-n https://github.com/ace-n and @jmdobry https://github.com/jmdobry, do you think the risk samples can be rewritten to queue the actual job status without making them look too complicated at the same time?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/nodejs-dlp/issues/49#issuecomment-386856629, or mute the thread https://github.com/notifications/unsubscribe-auth/ACM8xd6DYiQIlR5AMPaohf6dJ_wM0aG5ks5tvpK-gaJpZM4Tzzbc .

alexander-fenster commented 6 years ago

Thanks @ace-n!

I strongly believe increasing a timeout is never a solution, and I'd ask not to apply this kind of band-aid :)

I still feel there are two separate (though related) issues here, the first one being the sample sometimes not working due to eventual consistency, and the second one is the test that fails when there is no real problem. My original intent here was to concentrate on the latter one.

What do you think of marking those risk tests as skipped until we get a better idea on how to handle the polling? I can create a separate issue for that if needed.

For me right now, to have a consistently clean nightly run is a task with higher priority than to have a better test coverage, as flaky tests can mask more serious problems.

ace-n commented 6 years ago

I'm happy to mark them as flaky (or skipped) - it's still a band-aid, but a more reliable one nonetheless. 😄

lesv commented 6 years ago

If customers will see this, then we should actually provide a solution. (Ideally in either the client library or at the server side), but if it needs a work-a-round in client code, we should do this. (Every customer shouldn't have to find this themselves)

realjordanna commented 6 years ago

Taking a look server side ... we did find a bug where we were publishing to pub-sub too soon. That fix should roll out tomorrow and we'll see how that impacts the flakiness. I hope it fixes it entirely.

alexander-fenster commented 6 years ago

Thanks @realjordanna! I'll wait for several days and will update the thread then.

alexander-fenster commented 6 years ago

I'll close this issue for now since I don't see this problem anymore, will reopen if it fails again. Thanks!

googleapis / nodejs-dlp

samples tests flaky failures #49