Open baltpeter opened 6 days ago
Solving the timeouts in TrackHAR was easy enough at least by make use of chunking: https://github.com/tweaselORG/TrackHAR/pull/90
I was able to make one more "optimization": We previously kept the "Only include requests that are made to the same endpointUrl by apps from at least two different vendors" rule as-is for websites, but using the hostname here.
That was perhaps a bit generous. This way, it was enough for an endpointUrl
to be accessed by www.example.org
and login.example.org
to be included. I have instead changed it to require requests from ten different hosts for websites now (https://github.com/tweaselORG/data.tweasel.org/commit/98321338f89c60a55c3d7471ae46586025cf6497). Apps still only need two different vendors.
With this, the database has shrunk to 7.9 GB and we now have 1,912,743 requests with 1,592,854 of those being from websites.
This has helped a little but the performance is still not great. And quite a few more complex queries, including ones on the homepage, are still timing out.
I fear that there isn't much more that we can do in terms of optimizations. I really see only two options:
What do you think, @zner0L?
Another random thought I just had, inspired by accidentally having two different databases in my local instance: Datasette can handle multiple databases.
We could have separate databases for web and apps.
But upon thinking about this a little more, it probably isn't a good idea:
I tried running the server in immutable mode as suggested by documentation and it did improve the performance a bit, even though it still take quite long. I think part of this is also that some queries are just inefficient.
We can try to squeeze more performance out of this instance with better queries and optimized data storage. However, SQLite is just not built for this amount of data. Right now we are using a pretty inefficient software stack to handle too big of a dataset. I really don’t like solving these kinds of problems with better hardware…
To run in immutable mode while using directory configuration mode, datasette needs the database mentioned in inspect-data.json
in the config directory. However there is an inconsistency in datasette when comparing database paths (https://github.com/simonw/datasette/issues/2433) which is why I couldn’t get this to work at first: The file
property needs to contain just the filename, not the path.
This improves the performance a bit and allows for the database to be downloaded. However, it introduces the problem where to generate and store the inspect-data.json
. Do we write a script to generate it on each upload? Do we generate it locally and upload it into version control? I am interested in your thoughts, @baltpeter.
Oh, interesting. I also ran into the issue of not being able to combine configuration directory mode and immutable mode when I initially set up Datasette but didn't find the trick of using inspect-data.json
. I also didn't investigate further because from reading Simon's blog, I know that there will be a pretty major overhaul of the whole config system in the upcoming Datasette 1.0.
That being said, from my understanding after reading the docs, inspect-data.json
should provide some performance benefits in addition to the ones from immutable mode even, so I guess it might be worth it to implement that already.
Since our deployment method for the database is currently "copy and run this scp
command", I guess adding a second command for generating inspect-data.json
wouldn't be too bad.
Now that I have imported the data from our first traffic collection on the web (https://github.com/tweaselORG/experiments/issues/3), the Datasette instance has gotten a lot more sluggish.
That shouldn't come as too much of a surprise. We have jumped from 323,116 rows in the
requests
table to 2,614,200 and the database has grown to more than 10 GB (from a little over 1 GB iirc).Probably most critically, I have had the
debug-adapter
tool in TrackHAR fail due to timeouts quite regularly since I have imported the data, and that is after I had already increased the timeout to 50s (https://github.com/tweaselORG/data.tweasel.org/commit/08bbf1907eb6f0ad1229b3bf67987181baa25791#diff-92a81eca627ad137f98388e8aeae273e0a4069fde8995897ab138b5fee849cdcR3).