googleapis / repo-automation-bots

A collection of bots, based on probot, for performing common maintenance tasks across the open-source repos managed by Google on GitHub.
Apache License 2.0
626 stars 126 forks source link

Buildcop: Missing reports #727

Closed tmatsuo closed 4 years ago

tmatsuo commented 4 years ago

This morning we had some failures, but we didn't get reports from build cop bot. This time, the logs should be small, so I think it's another failure, so I'm filing a new bug.

Python 3.6 build

Excerpts from the log:


=================================== FAILURES ===================================
_________________________ test_model_list_get_evaluate _________________________
Traceback (most recent call last):
  File "/workspace/translate/automl/model_test.py", line 80, in test_model_list_get_evaluate
    assert "evaluation_metric" in out
AssertionError: assert 'evaluation_metric' in 'name: "projects/1012616486416/locations/us-central1/models/VAR6465236123462402048/modelEvaluations/452876416420741254"\ncreate_time {\n  seconds: 1595365133\n  nanos: 496593000\n}\nevaluated_example_count: 30\n\n'
=============================== warnings summary ===============================
dataset_test.py:28
  /workspace/translate/automl/dataset_test.py:28: PytestUnknownMarkWarning: Unknown pytest.mark.slow - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/latest/mark.html
    @pytest.mark.slow

-- Docs: https://docs.pytest.org/en/latest/warnings.html
-------- generated xml file: /workspace/translate/automl/sponge_log.xml --------
=========================== short test summary info ============================
FAILED model_test.py::test_model_list_get_evaluate - assert 'evaluation_metri...
======== 1 failed, 4 passed, 1 skipped, 1 warning in 199.61s (0:03:19) =========
Command pytest --junitxml=sponge_log.xml failed with exit code 1
Session py-3.6 failed.
[Buildcop] Sending logs to Build Cop Bot...
[Buildcop] See https://github.com/googleapis/repo-automation-bots/tree/master/packages/buildcop.
[Buildcop] Published sponge_log.xml (1365873899161698)!
[Buildcop] Done!

You can download the test.xml file from the artifact page

These are also builds with the same failure: Python 3.7 build Python 3.8 build

Python 3.8 build also has another failure which we also didn't get the report:

Session py-3.7 failed.
[Buildcop] Sending logs to Build Cop Bot...
[Buildcop] See https://github.com/googleapis/repo-automation-bots/tree/master/packages/buildcop.
[Buildcop] Published sponge_log.xml (1365946386256562)!
[Buildcop] Done!
tbpg commented 4 years ago

@bcoe did we get the no-Cloud-Tasks-for-Build-Cop change rolled out?

bcoe commented 4 years ago

@tbpg yes, build cop should not be using cloud tasks right now.

On one hand, this means we should be able to handle large payloads, on the other hand it makes it slightly harder to see failures.

tbpg commented 4 years ago

Looking at the logs, the payload seems to be missing when it gets to the buildcop Cloud Function. There are a lot of errors like this:

[undefined/undefined] No xunitXML and no testsFailed! Skipping. 

Perhaps this should be an error so we catch this in the future?

The last successful logs were the evening of July 21. EqdqTgzcOXC.

2448b1c62ad4d8f3ab80776e039a92af811bb6c7 and e381a7dbc4e2bdb3b3a3dae46d312cf2ed150385 were the two commits on July 21. But, they don't seem related? The gcf-utils update didn't update buildcop. The logging change only changed gcf-utils (and not the version of gcf-utils that other packages depend on).

azizsonawalla commented 4 years ago

Could it be that the scheduler/pubsub payload never had the repo/owner in it and was being injected by handleScheduled() (before we switched off background tasks): https://github.com/googleapis/repo-automation-bots/blob/cc3b7bccda259cf10bbba44f245aa9534892cffe/packages/gcf-utils/src/gcf-utils.ts#L455-L469

we could test this by either checking the messages in the pub/sub queue in GCP (I don't see any messages in there right now) or we could log the payload just before this line: https://github.com/googleapis/repo-automation-bots/blob/cc3b7bccda259cf10bbba44f245aa9534892cffe/packages/gcf-utils/src/gcf-utils.ts#L357

azizsonawalla commented 4 years ago

Ok I think I know what's wrong here. We originally had buildcop using Task queues to handle tasks in the background and so when we received a payload like this from Pub/Sub:

{
  "name": "buildcop",
  "type": "function",
  "location": "us-central1",
  "installation": {
    "id": "5943459"
  },
  "repo": "GoogleCloudPlatform/ruby-docs-samples",
  "commit": "bc55d50b4acf0f4144e94b2a77c34cba038be49f",
  "buildURL": "[Build Status](https://source.cloud.google.com/results/invocations/cbf6df00-2459-40e6-86d0-3574d4d87e6c), [Sponge](http://sponge2/cbf6df00-2459-40e6-86d0-3574d4d87e6c)",
  "xunitXML": "PD94bW...[edited]...Cg=="
}

We would then pass this payload to handleScheduled() in gcf-utils.ts: https://github.com/googleapis/repo-automation-bots/blob/4f928b100586a3c2d69acb846e6081c0b9f03038/packages/gcf-utils/src/gcf-utils.ts#L358-L360

handledScheduled() would then call scheduledToTask() with this payload: https://github.com/googleapis/repo-automation-bots/blob/4f928b100586a3c2d69acb846e6081c0b9f03038/packages/gcf-utils/src/gcf-utils.ts#L455-L458

scheduledToTask() would then transform this payload to add the repository object: https://github.com/googleapis/repo-automation-bots/blob/4f928b100586a3c2d69acb846e6081c0b9f03038/packages/gcf-utils/src/gcf-utils.ts#L501-L509

So when we stopped using background tasks by adding this flag we completely skipped this transformation process: https://github.com/googleapis/repo-automation-bots/blob/4f928b100586a3c2d69acb846e6081c0b9f03038/packages/buildcop/src/app.ts#L20

Buildcop is now looking for body.repository.name and body.repository.login but that doesn't exist because the repository doesn't exist: https://github.com/googleapis/repo-automation-bots/blob/cc3b7bccda259cf10bbba44f245aa9534892cffe/packages/buildcop/src/buildcop.ts#L113-L117

Solution:

Option 1 - we change buildcop to look at body.repo instead Option 2 - we modify gcf-utils to do the transformation and add the repository key even when background: false

Happy to make a PR with the fix if we decide on which option to go with...I would think option 2 would be the better choice.

tbpg commented 4 years ago

Thank you for the fix! Reopening because the bot is still not working. Let's wait until we get the gcf-utils upgrade through and we see successful logs.

To be clear, I'm good to change the buildcop code to get the owner/repo from wherever.

azizsonawalla commented 4 years ago

No worries! I'm waiting on @bcoe to merge this PR https://github.com/googleapis/repo-automation-bots/pull/762 (I believe he is testing a new release framework) before updating buildcop's gcf-utils version.

I think making the change within gcf-utils is the better way to go since buildcop won't be the only bot that will want to parse the owner/repo name from the GitHub payload. There was also some confusing flows for scheduler payloads vs. pubsub payloads that needed fixing, so this was a good opportunity!

azizsonawalla commented 4 years ago

Confirmed via logs that issue is now fixed

tbpg commented 4 years ago

Thank you!