Open godfrzero opened 5 years ago
@godfrzero I'd like to confirm that all the monitors have completed their runs today with the temporary fix in place :)
Awesome. I'm leaving things they way they are now and I'll think about getting this out as a mainstream behavior. If you notice any deterioration in performance down the line, please let me know.
@godfrzero Had few monitors time out again, checked the responses on those and one request took longer than usual (46 seconds vs usual ~30 seconds), however from the times on local machine (~3m 30s), that still shouldn't have resulted in a timeout. Could you allocate possibly a little more resources to ensure this doesn't happen?
Did the runs in question time out and get automatically re-attempted by the platform?
@godfrzero Not sure how I would check that, however there were 2 timeout errors at the end, assuming that means that 2 runs have tried and failed.
The first log would be an error which caused the first attempt to fail, followed by the complete run log.
Also, is it necessary that all of these monitors run at the same time? If they can be split into batches (let's say 3 of them), and each batch run 5 minutes after the previous one, that would also avoid this bottleneck. If you edit the monitor via the API, you can actually set a monitor to run at a time specified by any valid CRON pattern. So for example, you can run at 5 minutes past 9 instead of at 9.
@godfrzero Looking at the logs, there is only two errors at the very end:
2165 | 9:05:16 | Error: callback timed out
-- | -- | --
2166 | 9:05:16 | Error: callback timed out
And I tried to change the schedule, however I am getting this error:
{
"error": {
"name": "cronPatternNotAllowedError",
"message": "The specified cron pattern is not allowed. Please check https://monitor.getpostman.com for the allowed schedules.",
"details": {
"pattern": "0 30 8 * * MON-FRI"
}
}
}
If I was able to set the runs with 5 minute intervals (8:20 - 9:00) Mon-Fri, I can separate all the calls into 5 monitor batches, which would actually be nicer for us as well, as then we can see in Slack each environment being completed one by one rather than a dump of all the runs.
Ah, we've got some strict checks on the public API. I'll push an update the production Postman API in a couple of days which will relax this and allow this pattern to be used when creating/updating a monitor.
@VicKetchup Can you try now? The pattern should now accept a specific minute.
@godfrzero Hey, just came back from holidays, tried the pattern that was failing before and it worked, thanks!
@godfrzero @ArjunSingh-PM I was doing some performance improvements on our collections to ensure the monitors don't time out (as we started having some issues again) and after some changes that clearly reduce the time of execution of tests, I ran the monitors and still got timeouts. I started looking at time stamps and noticed that on the runs where timeouts occur, there is a few seconds delay between requests (up to 6 seconds from what I've seen).
Shall I raise a separate issue for this or does this cover it?
Here is example of what I am seeing:
@VicKetchup Thank you for bringing this up. Could you please raise a separate ticket for this.
@godfrzero We are facing issues with Monitors again (not sure exactly when it started). The collection executes in around 3 minutes (on 2014 Macbook Pro), however the monitors timeout, sometimes not even getting half way through :(
Is your feature request related to a problem? Please describe. A collection executing during a monitor run can exhibit various performance characteristics:
This can lead to problems where a CPU intensive run executing on a host with limited CPU performs poorly. In other cases, the run will be allocated to a relatively "free" host and the performance will be much better. In some cases, this can cause some monitor runs to time out and others to succeed.
Describe the solution you'd like The overall system would benefit if runs were allocated to hosts taking performance characteristics into consideration. For example, a host executing a CPU intensive run would not accept additional CPU intensive runs. Alternatively, regional clusters could also be split up based on CPU/Memory/Network availability.
Describe alternatives you've considered