Requests with high data load are failing

aidanm1999 commented 9 months ago

Describe the Bug

When you request a period of a week, month, or year on some websites, the requests fail. See the screenshots below for one of our websites with a query for "All time" and "24th of Jan 2024"

Failing request umami-load-fail

Successful request umami-load-success

This is a fairly persistent issue, meaning we can only request very small data samples, or we have to make API calls directly to the database.

The reason why this happens is that Vercel and Netlify both have small timeouts for cloud functions (around 20 seconds I believe Vercel source and Netlify source), which means you cannot complete the request. The screenshot below shows the slowest queries over the past 24 hours in Planetscale:

As you can see some of the queries take over 6 minutes, which means there is no time for the cloud function to:

Make the call
(await the response)
Parse the response
And send to the client

This is when it usually times out. The first step we took to mitigate this issue was to move the Netlify and Vercel cloud functions to the same AWS datacentre as the Planetscale database, slightly increasing performance.

A suggestion might be to move the web app and database to a cloud server so there are no timeout constraints, however, it would be better if our clients receive the result in under 20 seconds.

So this leads to how the Umami cloud function requests and parses data. It should be investigated how to optimise these queries, with differing strategies. One potential solution could be to create a rollup table where a function periodically fetches data from the previous hour or day, greatly reducing the quantity of data retrieved. Another potential solution could be improving the indexing in the database as @pottiepierre suggests in #2412.

Database

MySQL

Relevant log output

No response

Which browser are you using? (if relevant)

N/A

How are you deploying your application? (if relevant)

Vercel & Netlify with PlanetScale

aidanm1999 commented 9 months ago

Temporary improvement

If you are using Vercel and are on a Pro/Enterprise account, you can have up to 5 mins/300 seconds of serverless function execution. Add below to a vercel.json file and push up to your repo:

{
  "functions": {
    "src/pages/api/**/*": {
      "maxDuration": 300
    }
  }
}

franciscao633 commented 9 months ago

This seems to be a issue with our self-hosted DB options (psql/mysql) and higher traffic websites. Probably need to revisit the indexing to see if there can be improvements there. For rolling data into hours, days, etc we looked into timescaleDB as a psql extension, but I guess there should be a way to implement materialized views in our current options. Also we are looking to add ClickhouseDB support to our self-hosted options. We use the same database for our cloud product and it has no problems querying tables with billions of records.

yuanfanzz commented 9 months ago

the requests fail

I have over 100K visits every day, and it takes tens of seconds to open the weekly report (on a 4 cores Ryzen 7950x server). Shows errors like yours too. Therefore, I had to stop using Umami because it writes millions of records in the database and does not regularly clean up and summarize them.

aidanm1999 commented 9 months ago

@franciscao633 is this triaged or planned for development? Would be good to know a schedule of release for this bug fix, thanks!

aidanm1999 commented 8 months ago

Planetscale released a new feature today for database insights. Check out below.

All recommendations

Screenshot 2024-02-29 at 10 40 36

More info on the first recommendation

Screenshot 2024-02-29 at 10 48 44

Interesting that it believes there are so many redundant indexes, worth investigating if this is also impacting database memory and performance. @franciscao633 please let me know if you want me to share more details of each recommendation.

franciscao633 commented 8 months ago

Interesting. In terms of app performance I'm mostly interested about indexing on session / website_event. Any noticable improvements on your end from the changes? I agree with @yuanfanzz, in that we won't see huge performance leaps until we start aggregating the data and running queries on those instead of the entire table. Also finally getting around to supporting ClickhouseDB for those who want to self-host high traffic websites.

aidanm1999 commented 8 months ago

Not had a chance yet to make a dump of the database, apply and compare the changes but hopefully later in the week my schedule will clear and I can give you a breakdown of before and after.

Data aggregation is a good idea, there are pros and cons for it and lots of open question for consideration around cost of background processes, data increase size, data loss on query, etc...

aidanm1999 commented 8 months ago

Managed to get round to some light tests, TL;DR they are better, but not definitive. I made two branch dumps on PlanetScale and applied the recommendations listed above to one of them, built and served the site from my machine and queried the page view endpoint 6 times for the applied recommendations and for the original (current) database schema.

Endpoint

http://localhost:3000/api/websites/[WEBSITE_ID]/pageviews?startAt=1709856000000&endAt=1710460799999&unit=day&timezone=Europe%2FLondon

Results

Current migration	Recommendations applied
13.89s	12.89s
18.18s	12.79s
12.37s	12.87s
19.22s	12.91s
9.57s	12.84s
13.3s	13.22s

Note: page view result data for size comparison

```json { "pageviews": [ { "x": "2024-03-08", "y": 2621 }, { "x": "2024-03-09", "y": 1581 }, { "x": "2024-03-10", "y": 1623 }, { "x": "2024-03-11", "y": 3153 }, { "x": "2024-03-12", "y": 3037 }, { "x": "2024-03-13", "y": 2652 }, { "x": "2024-03-14", "y": 2809 } ], "sessions": [ { "x": "2024-03-08", "y": 887 }, { "x": "2024-03-09", "y": 629 }, { "x": "2024-03-10", "y": 646 }, { "x": "2024-03-11", "y": 964 }, { "x": "2024-03-12", "y": 910 }, { "x": "2024-03-13", "y": 896 }, { "x": "2024-03-14", "y": 883 } ] } ```

Conclusion

On average the current migration took 14.42s and with the recommendations the average query took 12.92s.

Other notes

I did however test the difference between using a specific AWS endpoint on PlanetScale (eu-west.connect.psdb.cloud) vs the auto-resolver (aws.connect.psdb.cloud) and it increased speeds from ~22s to the figures shown above. Not important for most users but could cause some performance improvements for PlanetScale users.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 60 days with no activity.

aidanm1999 commented 5 months ago

Not stale

dennishendricks commented 4 months ago

Aggregated data for quick queries would be excellent!

@franciscao633 In the meantime: Is a timeout set for the queries in the frontend of Umami? I couldn't find anything like that in the source code, but I can execute a query via API for 130 seconds without any problems, while in the frontend the connection is terminated after only 10 seconds with a 500 error (but unfortunately without any further information about why the status 500 is returned).

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 60 days with no activity.

aidanm1999 commented 2 months ago

Not stale

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 60 days with no activity.

m-primo commented 2 weeks ago

not stale

umami-software / umami