google / timesketch

Collaborative forensic timeline analysis
Apache License 2.0
2.62k stars 589 forks source link

Infinite Plaso import (spinning wheel) - Duplicate timeline statuses in database #3180

Open jbaptperez opened 2 months ago

jbaptperez commented 2 months ago

Describe the bug

When importing a set of Plaso files at the same time (concurrent HTTP importations), one can reach to the following undesired state: In the Timesketch database, the table timeline_status contains (at least) 2 lines with the same parent_id:

The ready line shows that the import is already done, but the processing line is still there: This is the bug.

At the frontend side, the related timeline permanently shows a spinning wheel, expecting the end of the (already terminated) process.

Actually, I can identify several cases with the same parent_id in the timeline_status table, which is already problematic However, the bug becomes visible in the frontend when one of the less recent line's status is set to processing.

I suspect this is linked to concurrent access to the database from workers when importing Plaso data. I mean, I think no transaction protects the access to the timeline_status table.

I think the bug is precisely here (see the TODO comment) (timesketch/models/annotations.py):

    def set_status(self, status):
        """
        Set status on object. Although this is a many-to-many relationship
        this makes sure that the parent object only has one status set.

        Args:
            status: Name of the status
        """
        # TODO Fix refresh self.status now.
        for _status in self.status:
            self.status.remove(_status)
        self.status.append(self.Status(user=None, status=status))
        db_session.add(self)
        db_session.commit()

To Reproduce

Steps to reproduce the behavior:

  1. Create a campaign,
  2. Create a timeline,
  3. At the same time (concurrent HTTP requests), import a set of Plaso files to this timeline.

The problem cannot be reproduced systematically and Plaso files sizes can be small.

Expected behavior

The processing line should have been removed from the timeline_status table when the import is done, and the ready line should be the only one remaining.

In such a case, the spinning wheel should disappear from the GUI, making data available for this timeline.

More generally in the database, a timeline should never have more than a single timeline_status line.

Screenshots

image

Desktop (please complete the following information):

Note: The desktop set-up is not related to the problem as it is a backend bug.

Additional context

GET /api/v1/version:

{
  "meta":{
    "plaso_version":"20240308",
    "version":"20240508.1"
  },
  "objects":[]
}

Below is a set of SQL queries to easily identify the bug:

Getting timelines with the bug, that is:

select
    sketch.id,
    sketch.updated_at,
    sketch.name,
    timeline.id,
    timeline.updated_at,
    timeline.name,
    timeline.searchindex_id,
    timeline_status.id,
    timeline_status.updated_at,
    timeline_status.status,
    timeline_status.rank
from (
    select * from (
        select *, row_number() over (partition by parent_id order by created_at desc) rank
        from timeline_status
        where parent_id is not null
    ) duplicates
    where
        duplicates.rank > 1
        and duplicates.status = 'processing'
    order by created_at
) timeline_status
    inner join timeline on timeline_status.parent_id = timeline.id
    inner join sketch on timeline.sketch_id = sketch.id;

Fixing the bug, that is, deleting the timeline_status lines with the status set to processing when they are not the most recent one.

delete from timeline_status where id in (
    select id
    from (
        select *, row_number() over (partition by parent_id order by created_at desc) rank
        from timeline_status
        where parent_id is not null
    ) duplicates
    where
        duplicates.rank > 1
        and duplicates.status = 'processing'
);

timeline_status lines for timelines bound to more than one with the processing bug:

select
    sketch.id,
    sketch.updated_at,
    sketch.name,
    timeline.id,
    timeline.updated_at,
    timeline.name,
    timeline.searchindex_id,
    timeline_status.id,
    timeline_status.updated_at,
    timeline_status.status
from (
         select
             parent_id as id
         from (
                  select
                      parent_id,
                      status,
                      row_number() over (partition by parent_id order by created_at desc) as rank
                  from timeline_status
                  where parent_id is not null
              ) grouped_timeline_status
         where
             grouped_timeline_status.rank > 1
             and grouped_timeline_status.status = 'processing'
         group by parent_id
         order by parent_id
     ) problematic_timeline
         inner join timeline_status on problematic_timeline.id = timeline_status.parent_id
         inner join timeline on problematic_timeline.id = timeline.id
         inner join sketch on timeline.sketch_id = sketch.id
order by
    sketch.id,
    timeline.id,
    timeline_status.created_at;

All timeline_status lines for timelines bound to more than one (a more general problem):

select
    sketch.id,
    sketch.updated_at,
    sketch.name,
    timeline.id,
    timeline.updated_at,
    timeline.name,
    timeline.searchindex_id,
    timeline_status.id,
    timeline_status.updated_at,
    timeline_status.status
from (
    select
        parent_id as id
    from (
        select
            parent_id,
            row_number() over (partition by parent_id order by created_at desc) as rank
        from timeline_status
        where parent_id is not null
    ) grouped_timeline_status
    where
        grouped_timeline_status.rank > 1
    group by parent_id
    order by parent_id
) problematic_timeline
    inner join timeline_status on problematic_timeline.id = timeline_status.parent_id
    inner join timeline on problematic_timeline.id = timeline.id
    inner join sketch on timeline.sketch_id = sketch.id
order by
    sketch.id,
    timeline.id,
    timeline_status.created_at;

Distribution of timelines with more than a single timeline_status:

select
    sketch.name,
    timeline.name,
    timeline_status_stat.count
from
    sketch
    inner join timeline on sketch.id = timeline.sketch_id
    inner join (
        select parent_id,
            count(1) as count
        from timeline_status
        where parent_id is not null
        group by parent_id
        having count(1) > 1
    ) timeline_status_stat on timeline.id = timeline_status_stat.parent_id
order by
    timeline_status_stat.count desc,
    sketch.name,
    timeline.name;