shutupflanders commented 1 year ago

The jobs endpoint needs optimizing / response size reducing, after a day's worth of jobs (300 or so) the endpoint hangs for ~30s and because of the polling nature of the dashboard, it stacks up requests.

Suggest looking at cancelling previous calls (I believe Axios lets you do this easily) and creating a new "slim resource" for the api response (we don't need all that data on the job list)

shutupflanders commented 1 year ago

Happy to take a look at the API side as I've got some time free today, forked and branched - will update when done

tobytwigger commented 1 year ago

Morning. That's definitely problematic - the idea of a 'slim resource' is an excellent idea! I'll look into cancelling previous calls over the weekend.

My initial idea with a 'slim resource', taking inspiration from GraphQL, was to pass a list of properties you wanted returned. Since this API is open to developers to use on their site, I figured that would give the most amount of flexibility and avoid forcing users to use the full request.

If you'd rather just get it working for now, I'll happily accept a PR for a 'slim resource', then can add the specifying what properties to return over the next couple of weeks as I finish this package off :)

tobytwigger commented 1 year ago

Just merged https://github.com/tobytwigger/laravel-job-status-js/pull/2, which pauses polling whilst a request is in flight and therefore stops requests from stacking up.

tobytwigger commented 1 year ago

Hi, just finished a few optimisations so I thought I'd write up the progress. I think I've laid out the way to reduce the API response time you suggested, I'll find time over the next couple of days to take a look if you don't get a chance. Would love any suggestions if anything could be improved or done differently.

Issues

[x] Reducing the number of API queries - used stopping requests
[x] Reducing queries
[ ] Reducing properties returned
[x] Reducing manipulation time

Reducing the number of API queries

https://github.com/tobytwigger/laravel-job-status-js/pull/2 sorted this by only allowing one query at a time.

Reducing queries

Using eager loading in #75 I've got the number of queries made to 8 no matter how many jobs you have. This is therefore way under a second for the query, despite not having done any other optimisations, so much more manageable.

Reducing properties returned

As you said above, we don't need many properties for the dashboard. However this API is open to developers to create their own frontend components to reflect the job status, so I'm keen to keep it with as much information as possible.

We can have a ?properties= query parameter on every API, which determines which properties are loaded. This can be used when loading a job from the database to limit selections, and when converting it into array. Either way it should save on compute and database calls.

Reducing manipulation time

The best way I can see to do this without reducing functionality is to lean into pagination. By processing as little data as possible to fulfill the query, we'll get to an optimized manipulation strategy without having to rework things significantly.

/runs

For loading /runs, I think we may be able to use SQL to get a list of job uuids that we want to load. This will be something like sort the jobs by created_at/id, group by uuid and get the page using limit.

select MAX(created_at) as created_at_max, uuid from job_status_job_statuses group by uuid order by created_at_max desc; will get all the job UUIDs for the second page of the dashboard.

Then we can build up those jobs in full by retrieving their retries/messages etc (using eagerloading to keep the queries low). So the first query we make tells us what jobs to load, and the second loads it.

This would be the minimum amount of data we could retrieve without being restrictive, and as long as pagination is being used it should scale?

In terms of how to use it, we want the user to call it instead of get() when making a query so we can control select parameters/what we actually get. This could just be a paginateRuns method, ideally on the builder but on the model if that isn't possible. It'll set the query to select the above, group/order (so can really only be used with 'where' parameters), and get the right page of unique uuids. This is given to the JobStatusCollection automatically, and then when we call runs it'll build up the jobs as above, assuming the given jobs only have uuids.

/jobs

For loading something like /jobs, we could use the same method.

select MAX(created_at) as created_at_max, alias from job_status_job_statuses group by alias order by created_at_max desc;

We then have a list of jobs to load and return. Currently we do this along with all the runs, which is too much data and never necessary. The dashboard shows the runs in a job using /runs and filtering to the job alias.

Therefore, we can load the job using custom queries to summarise the data. E.g. make one query to get the count of the runs in each job grouped by status which gives us the number failed, succeeded etc.

I suspect we'll need more queries here eventually (things like predictive timings for the queue). But since we're paginating, the number of queries shouldn't grow with n+1. Even if we make 10 queries per queue that's only 100 total. We may be able to cache these or make them once for every queue and share the results to each queue

The query to get the page of aliases will be created in a paginateJobs method on the builder/model. This will return a load of job statuses with only the alias filled in. On calling jobs() in the job status collection, we build up the aliases by just creating a `JobStatus\Search\Result\TrackedJob() class and passing it the alias. When converted to an array, it'll return all the requested properties through making the additional queries or taking from a cache.

We could also just return a JobCollection, QueueCollection etc from the paginateJobs/paginateQueues methods, and skip the middle bit. This would make the whole query JobStatus::paginateQueue($page, $properties), to get a page of queues with all the data accessible, making only the minimum queries needed.

/queues and /batches

The same logic as /jobs can be applied.

shutupflanders commented 1 year ago

Sorry, been away all weekend.

I did make some progress on Friday with regards to reducing the amount of data returned from the API but having to manipulate it to fit the current SPA was troublesome, I was still hitting the n+1 issue.

From what I've understood above you're on the right track to getting things reduced and optimized - but you're still going to have n+1 problems. It might be worth creating a job_parent table to hold statistics for the jobs grouped by uuid each time a job is run/updated the parent record is updated - allowing a quick single row query when calling /jobs/list - but this will go against your vision of having a more "open" API.

Extensibility is a brilliant thing, don't get me wrong - but how many of your users will actually use the custom API? I don't see many people doing it with Horizon for example.

shutupflanders commented 1 year ago

SELECT
    MAX(created_at) as created_at_max,
    alias,
    MAX(class) as class,
    MAX(uuid) as uuid,
    SUM(IF(status = "succeeded", 1, 0)) as succeeded,
    SUM(IF(status = "failed", 1, 0)) as failed,
    SUM(IF(status = "queued", 1, 0)) as queued,
    SUM(IF(status = "running", 1, 0)) as running,
    SUM(IF(status = "cancelled", 1, 0)) as cancelled
FROM job_status_job_statuses
GROUP BY alias
ORDER BY created_at_max desc;

This query should do for the jobs/list page right? gets you everything you need and runs very quickly. The result set would need a transformation to give the SPA what it's requesting, alternatively a new Typing for the jobs dashboard could be made.

tobytwigger commented 1 year ago

In terms of keeping all the data available (on request), totally get that the majority of people will likely use this as a quick plug-and-play job debugging/tracking dashboard. The initial motivation for this was an app I was working on that synced activities from Strava. I wanted to show how many activities were left and be able to show you a little 'status' of an activity, including messages, percentages etc for the most recent run. Using the API, this is possible, and I've got a few use cases like that so as long as it's efficient enough it'd be great to keep lots of data available.

Using the ?properties= should make the dashboard more efficient there - then we'll just load exactly what's needed for the dashboard.

The job_parent table isn't a bad idea if it's needed, would definitely make it easier to select data and reduce queries. Though it would be some work to set up and maintain which I'm keen to avoid if possible, and could lead to more queries through keeping that table up to date.

That query looks fab for the jobs/list page, should get everything that we need. We can add a LIMIT OFFSET,PAGE to paginate directly from the query to reduce the data we get though, so we don't need to load the whole dataset.

The transformation will be done in the JobStatusCollection, and since the dataset will be limited by page it shouldn't be too computationally expensive to do this.

shutupflanders commented 1 year ago

Just tested the new updates, it's gone from 30s+ to 4s, so that's a definite speed improvement! I think once we reduce the data coming back with can get it to sub second response times.

tobytwigger commented 1 year ago

That's incredible, nice work! Do you think it'll be possible to limit the data dynamically based on a request input from what you've seen?

shutupflanders commented 1 year ago

Very possible, yes - your SPA needs to support the dynamic fields though

shutupflanders commented 1 year ago

A UI issue as arisen since last week though: It's not printing the status count in the element any more

tobytwigger commented 1 year ago

Very possible, yes - your SPA needs to support the dynamic fields though

Excellent! I'll have some time this evening to try and implement that - if you have time and you've done the code for it, it'd be great if you could submit a PR for the query improvement, then I'll build off that this evening

shutupflanders commented 1 year ago

if you have time and you've done the code for it, it'd be great if you could submit a PR for the query improvement, then I'll build off that this evening

I meant your code improvements from the weekend (the eager loading) shaved the time to 4s~ I'm still working on my attempt to implement the query above and still have it work with your SPA.

shutupflanders commented 1 year ago

I might implement a new endpoint for you to consume to be honest, rather than hack the old one. How's /jobs/overview sound?

This is purely to get things moving on the SPA before we start diving into the dynamic field retrieval.

tobytwigger commented 1 year ago

Without having spent too much time on it yet, I do think the proposal above will work to reduce the number of queries/data selected, but given this is currently blocking the dashboard from working well it's probably worth implementing this, even if it's then removed once the original API is more efficient.

I've got some time this evening so will look to fix up the original API then, but if you have time to put together the simpler API endpoint I can resort to just consuming that if I'm not able to fix up the original API? Meaning that by tomorrow it'll definitely work one way or another..!

/jobs/overview sounds great as an endpoint :)

shutupflanders commented 1 year ago

PR #77 submitted, hopefully it's a decent enough start

tobytwigger commented 1 year ago

Hey, thanks a lot for taking the time to put in #77 . I've adapted it slightly to be used as I described earlier, with the aim of optimising all the pages at once.

It's not quite as efficient as your solution, but I think it'll do for now. After 100,000 runs in the database most of the endpoints take a second to complete, and at 300,000 they take a few seconds to compete.

This isn't yet ideal, but should make the dashboard usable with relatively high numbers of jobs. Most this time is the database query execution time, since it's relatively complex it takes a while.

If you find the time it'd be great if you can see if it's efficient enough to be used now (all pages, not just the jobs page).

From here, we have three more optimisations that can be made

[ ] Limiting the properties returned. With the changes in #78 that should be pretty easy, and reduce the data we retrieve from the database
[x] Adding indexes to the DB - there aren't any at the moment - an index on queue/alias/batch_id may speed the query up and an ordering may too. Edit: Have now added indexes so an API endpoint takes about 2 seconds for 300,000 runs.
[ ] Optimising the queries further. These queries are made in the new 'Transformers', and still make additional requests that could be optimised.

I'll try and get these done in the next week, then hopefully it'll be really fast 🤞 But as long as it's working well enough that you can use the dashboard, I'll merge it and solve the remaining tasks with new issues as this one is getting long

tobytwigger / laravel-job-status

[API] Reduce/Optimize query for `/jobs` list #63

Issues

Reducing the number of API queries

Reducing queries

Reducing properties returned

Reducing manipulation time

/runs

/jobs

/queues and /batches