tweaselORG / data.tweasel.org

Scripts for hosting, and configuration of the Tweasel open data Datasette instance.
https://data.tweasel.org/
1 stars 0 forks source link

Data from web traffic collection makes Datasette instance a lot slower #3

Open baltpeter opened 6 days ago

baltpeter commented 6 days ago

Now that I have imported the data from our first traffic collection on the web (https://github.com/tweaselORG/experiments/issues/3), the Datasette instance has gotten a lot more sluggish.

That shouldn't come as too much of a surprise. We have jumped from 323,116 rows in the requests table to 2,614,200 and the database has grown to more than 10 GB (from a little over 1 GB iirc).

Probably most critically, I have had the debug-adapter tool in TrackHAR fail due to timeouts quite regularly since I have imported the data, and that is after I had already increased the timeout to 50s (https://github.com/tweaselORG/data.tweasel.org/commit/08bbf1907eb6f0ad1229b3bf67987181baa25791#diff-92a81eca627ad137f98388e8aeae273e0a4069fde8995897ab138b5fee849cdcR3).

baltpeter commented 6 days ago

Solving the timeouts in TrackHAR was easy enough at least by make use of chunking: https://github.com/tweaselORG/TrackHAR/pull/90

baltpeter commented 6 days ago

I was able to make one more "optimization": We previously kept the "Only include requests that are made to the same endpointUrl by apps from at least two different vendors" rule as-is for websites, but using the hostname here.

That was perhaps a bit generous. This way, it was enough for an endpointUrl to be accessed by www.example.org and login.example.org to be included. I have instead changed it to require requests from ten different hosts for websites now (https://github.com/tweaselORG/data.tweasel.org/commit/98321338f89c60a55c3d7471ae46586025cf6497). Apps still only need two different vendors.

With this, the database has shrunk to 7.9 GB and we now have 1,912,743 requests with 1,592,854 of those being from websites.

baltpeter commented 6 days ago

This has helped a little but the performance is still not great. And quite a few more complex queries, including ones on the homepage, are still timing out.

I fear that there isn't much more that we can do in terms of optimizations. I really see only two options:

  1. Pay for a more powerful server.
  2. Decide that using the top 10k on the web was a bit ambitious. Since the CrUX data is only published in buckets, this would mean that we would have to scale down to the top 1k instead.

What do you think, @zner0L?

baltpeter commented 6 days ago

Another random thought I just had, inspired by accidentally having two different databases in my local instance: Datasette can handle multiple databases.

image

We could have separate databases for web and apps.

But upon thinking about this a little more, it probably isn't a good idea:

  1. The web database would still be sluggish.
  2. For many things, we will explicitly want to work with the data from both realms.
zner0L commented 6 days ago

I tried running the server in immutable mode as suggested by documentation and it did improve the performance a bit, even though it still take quite long. I think part of this is also that some queries are just inefficient.

We can try to squeeze more performance out of this instance with better queries and optimized data storage. However, SQLite is just not built for this amount of data. Right now we are using a pretty inefficient software stack to handle too big of a dataset. I really don’t like solving these kinds of problems with better hardware…

zner0L commented 6 days ago

To run in immutable mode while using directory configuration mode, datasette needs the database mentioned in inspect-data.json in the config directory. However there is an inconsistency in datasette when comparing database paths (https://github.com/simonw/datasette/issues/2433) which is why I couldn’t get this to work at first: The file property needs to contain just the filename, not the path.

This improves the performance a bit and allows for the database to be downloaded. However, it introduces the problem where to generate and store the inspect-data.json. Do we write a script to generate it on each upload? Do we generate it locally and upload it into version control? I am interested in your thoughts, @baltpeter.

baltpeter commented 6 days ago

Oh, interesting. I also ran into the issue of not being able to combine configuration directory mode and immutable mode when I initially set up Datasette but didn't find the trick of using inspect-data.json. I also didn't investigate further because from reading Simon's blog, I know that there will be a pretty major overhaul of the whole config system in the upcoming Datasette 1.0.

That being said, from my understanding after reading the docs, inspect-data.json should provide some performance benefits in addition to the ones from immutable mode even, so I guess it might be worth it to implement that already.

Since our deployment method for the database is currently "copy and run this scp command", I guess adding a second command for generating inspect-data.json wouldn't be too bad.