google / clusterfuzz

Scalable fuzzing infrastructure.
https://google.github.io/clusterfuzz
Apache License 2.0
5.31k stars 555 forks source link

Error reporting: deduplicate by bot name #4359

Open oliverchang opened 2 weeks ago

oliverchang commented 2 weeks ago

Sometimes, a single machine can cause an error to bubble up to the top of our error reporting dashboard.

e.g. https://pantheon.corp.google.com/errors/detail/CNTsq_Sb7qfXSw;locations=global?e=-13802955&inv=1&invt=Abf8Rw&mods=logs_tg_prod&project=clusterfuzz-external happens 100k+ times a day, but it's all from a single bot having clock skew issues.

We should investigate if there is a way to reduce noise here by deduplicating error reporting entries by machine / origin.

oliverchang commented 2 weeks ago

@vitorguidi @alhijazi any thoughts here?

vitorguidi commented 1 week ago

It does not seem like we have that flexibility, knobs for deduplication are not exposed to the end user in GCP (ref). It takes exception and stacktrace info into account when grouping.

I opened a YAQS for the GCP logging folks, to see if there is anything we can explore and is not evident in the documentation.

jonathanmetzman commented 1 week ago

How about we put something on the bots to alleviate the problem of a bot polluting errors. Some options:

  1. Exit after hitting a certain number of errors. On linux the container will be restarted. On Windows we can reboot.
  2. Rate limiting error reporting.