google / timesketch

Collaborative forensic timeline analysis
Apache License 2.0
2.52k stars 577 forks source link

Slow page loads for sketches with high datasource count #3075

Open mbartle-sf opened 2 months ago

mbartle-sf commented 2 months ago

Describe the bug If a sketch is comprised of more than a few dozen datasources, the requests to /api/v1/sketches/ start to slow down as the server issues dozens of database queries to compile information about all of the datasources related to the sketch. This is exacerbated by #3052 when dozens of timelines must also be loaded and added to the response. Consider removing the datasource from the sketch response, and loading it on demand, instead.

To Reproduce Use the following script to produce 1000 datasources in a sketch.

from timesketch_api_client import client as timesketch_client
from timesketch_import_client import importer

def upload_n_events(sketch, n):
    for i in range(1000):
        entry = {"message": i, "datetime":"1970-01-01T00:00:00.000Z", "timestamp_desc": "test"}
        with importer.ImportStreamer() as streamer:
            streamer.set_sketch(sketch)
            streamer.set_timeline_name('uploads')
            streamer.add_dict(entry)        

def main():
    client = timesketch_client.TimesketchApi(host_uri='http://127.0.0.1:5000', username='dev', password='dev')
    sketch = client.get_sketch(1)
    upload_n_events(sketch, 1000)

if __name__ == "__main__":
    main()

Then attempt to load the sketch. If Postgres is on the same machine, you'll see the request to /api/v1/sketches/<id> takes a couple of seconds. If the database is on a remote server, the time to load is much higher, approaching the order of minutes.

If you enable postgres logging, you can see that Timesketch is issuing a SELECT query per object related to the sketch, i.e., 1000 queries for 1000 datasources (plus Timeline and sketch queries).

Expected behavior The sketch loads instantaneously with a database-on-disk, or in a couple of seconds with the database on a remote server.

Desktop (please complete the following information):

Additional context We prefer to load large timelines to our Timesketch server in batches, to make request sizes more reasonable, which is how we can end up with hundreds or thousands of datasources.