Allocate monitor runs based on performance characteristics

postmanlabs / postman-app-support

Postman is an API platform for building and using APIs. Postman simplifies each step of the API lifecycle and streamlines collaboration so you can create better APIs—faster.

https://www.postman.com

5.85k stars 839 forks source link

Allocate monitor runs based on performance characteristics #6266

Open godfrzero opened 5 years ago

godfrzero commented 5 years ago

Is your feature request related to a problem? Please describe. A collection executing during a monitor run can exhibit various performance characteristics:

High Memory / Low CPU
High CPU / Low Memory
High Network / Low Memory
And so on...

This can lead to problems where a CPU intensive run executing on a host with limited CPU performs poorly. In other cases, the run will be allocated to a relatively "free" host and the performance will be much better. In some cases, this can cause some monitor runs to time out and others to succeed.

Describe the solution you'd like The overall system would benefit if runs were allocated to hosts taking performance characteristics into consideration. For example, a host executing a CPU intensive run would not accept additional CPU intensive runs. Alternatively, regional clusters could also be split up based on CPU/Memory/Network availability.

Describe alternatives you've considered

Splitting collection into smaller chunks (until or unless a split is unrealistic)

VicKetchup commented 5 years ago

@godfrzero I'd like to confirm that all the monitors have completed their runs today with the temporary fix in place :)

godfrzero commented 5 years ago

Awesome. I'm leaving things they way they are now and I'll think about getting this out as a mainstream behavior. If you notice any deterioration in performance down the line, please let me know.

VicKetchup commented 5 years ago

@godfrzero Had few monitors time out again, checked the responses on those and one request took longer than usual (46 seconds vs usual ~30 seconds), however from the times on local machine (~3m 30s), that still shouldn't have resulted in a timeout. Could you allocate possibly a little more resources to ensure this doesn't happen?

godfrzero commented 5 years ago

Did the runs in question time out and get automatically re-attempted by the platform?

VicKetchup commented 5 years ago

@godfrzero Not sure how I would check that, however there were 2 timeout errors at the end, assuming that means that 2 runs have tried and failed.

godfrzero commented 5 years ago

The first log would be an error which caused the first attempt to fail, followed by the complete run log.

Also, is it necessary that all of these monitors run at the same time? If they can be split into batches (let's say 3 of them), and each batch run 5 minutes after the previous one, that would also avoid this bottleneck. If you edit the monitor via the API, you can actually set a monitor to run at a time specified by any valid CRON pattern. So for example, you can run at 5 minutes past 9 instead of at 9.

VicKetchup commented 5 years ago

@godfrzero Looking at the logs, there is only two errors at the very end:

2165 | 9:05:16 | Error: callback timed out
-- | -- | --
2166 | 9:05:16 | Error: callback timed out

And I tried to change the schedule, however I am getting this error:

{
    "error": {
        "name": "cronPatternNotAllowedError",
        "message": "The specified cron pattern is not allowed. Please check https://monitor.getpostman.com for the allowed schedules.",
        "details": {
            "pattern": "0 30 8 * * MON-FRI"
        }
    }
}

If I was able to set the runs with 5 minute intervals (8:20 - 9:00) Mon-Fri, I can separate all the calls into 5 monitor batches, which would actually be nicer for us as well, as then we can see in Slack each environment being completed one by one rather than a dump of all the runs.

godfrzero commented 5 years ago

Ah, we've got some strict checks on the public API. I'll push an update the production Postman API in a couple of days which will relax this and allow this pattern to be used when creating/updating a monitor.

godfrzero commented 5 years ago

@VicKetchup Can you try now? The pattern should now accept a specific minute.

VicKetchup commented 5 years ago

@godfrzero Hey, just came back from holidays, tried the pattern that was failing before and it worked, thanks!

VicKetchup commented 4 years ago

@godfrzero @ArjunSingh-PM I was doing some performance improvements on our collections to ensure the monitors don't time out (as we started having some issues again) and after some changes that clearly reduce the time of execution of tests, I ran the monitors and still got timeouts. I started looking at time stamps and noticed that on the runs where timeouts occur, there is a few seconds delay between requests (up to 6 seconds from what I've seen).

Shall I raise a separate issue for this or does this cover it?

Here is example of what I am seeing:

ArjunSingh-PM commented 4 years ago

@VicKetchup Thank you for bringing this up. Could you please raise a separate ticket for this.

VicKetchup commented 3 years ago

@godfrzero We are facing issues with Monitors again (not sure exactly when it started). The collection executes in around 3 minutes (on 2014 Macbook Pro), however the monitors timeout, sometimes not even getting half way through :(