microsoft / onefuzz

A self-hosted Fuzzing-As-A-Service platform
MIT License
2.82k stars 198 forks source link

Move timer functions to durable functions #2326

Open nharper285 opened 2 years ago

nharper285 commented 2 years ago

In moments of high-load the OneFuzz Service quickly hits it's quota for Azure requests and begins outputting the following exceptions:

Exception while executing function: Functions.agent_can_schedule Result: Failure
Exception: HttpResponseError: (OperationNotAllowed) The server rejected the request because too many requests have been received for this subscription.
Code: OperationNotAllowed
Message: The server rejected the request because too many requests have been received for this subscription.

And then:

Timeout value of 00:15:00 was exceeded by function: Functions.agent_can_schedule

As far as we can tell, these exceptions build, but the service eventually recovers.

Creating this issue to track a better way to control our requests and avoid these quota limits.

AB#35878

tevoinea commented 2 years ago

Do you know if these exceptions were being hit when communicating with a storage account? Or was it something else like scalesets/keyvault/etc.? Essentially I'm trying to understand if the data model refactor is going to solve this issue for us "for free"

nharper285 commented 2 years ago

Do you know if these exceptions were being hit when communicating with a storage account? Or was it something else like scalesets/keyvault/etc.? Essentially I'm trying to understand if the data model refactor is going to solve this issue for us "for free"

I'm not sure. Yesterday, I was seeing that several of our functions were hitting these exceptions - jobs, containers, agent_can_schedule, agent_commands, queue_file_changes, etc.