google / grr

GRR Rapid Response: remote live forensics for incident response
https://grr-doc.readthedocs.io/
Apache License 2.0
4.8k stars 763 forks source link

Frontend Performance #178

Closed pidydx closed 9 years ago

pidydx commented 9 years ago

It appears that the frontends are constantly updating metadata:last on aff:/. Is this necessary? Having so many threads trying to hammer the same subject/attribute is causing delays. I haven't been able to find the code to see exactly why it is doing that.

destijl commented 9 years ago

Talked with @scudette about this. This code is intended to maintain indexes: https://github.com/google/grr/blob/master/lib/aff4.py#L196

It shouldn't be updating the root very often because the root should be really hot and basically always in the cache. You can see this in the code above. It's possible that the cache is too small, or not functioning correctly for some reason. This is likely more of a problem during interrogate than other hunts because interrogate creates lots more aff4 path components (due to all the flows) than most hunts.

You could try increasing the size, perhaps it is too small? https://github.com/google/grr/blob/master/config/aff4.py#L19

We should probably also add some monitoring about cache hits and misses so we can see whether this is an issue for us too.

pidydx commented 9 years ago

I bumped the cache size to 20k from 10k and the intermediate to 10k from 2k. It seemed like that might have helped, but when I tried to relax the restriction on the SQL connection pool it quickly went back to hammering the database on subject aff4:/ attribute metadata:last.

Originally we had this problem because the code was telling it to delete aff4:/ metadata:last and then insert a new aff4:/ metadata:last. This was why I had to modify the MySQLAdvanced connector to try to do UPDATE instead of DELETE/INSERT wherever possible.

pidydx commented 9 years ago

I bumped again from 20k to 200k and 10k to 100k and expanded the connection pool for the frontends and this behavior continued. I don't think the cache is the problem. This issue only impacts the frontends, the workers don't cause this to happen.

scudette commented 9 years ago

I think I found the problem. The offending code is:

        try:
          self.intermediate_cache.Get(urn)
          return
        except KeyError:
          data_store.DB.MultiSet(dirname, {
              AFF4Object.SchemaCls.LAST: [
                  rdfvalue.RDFDatetime().Now().SerializeToDataStore()],

              # This updates the directory index.
              "index:dir/%s" % utils.SmartStr(basename): [EMPTY_DATA],
          },
                                 token=token, replace=True, sync=False)

          self.intermediate_cache.Put(urn, 1)

          urn = dirname

Should be:

        try:
          self.intermediate_cache.Get(dirname)
          return
        except KeyError:
          data_store.DB.MultiSet(dirname, {
              AFF4Object.SchemaCls.LAST: [
                  rdfvalue.RDFDatetime().Now().SerializeToDataStore()],

              # This updates the directory index.
              "index:dir/%s" % utils.SmartStr(basename): [EMPTY_DATA],
          },
                                 token=token, replace=True, sync=False)

          self.intermediate_cache.Put(dirname, 1)

          urn = dirname

This code is supposed to limit the writes of the LAST to each path component (to guarantee it is written at least once). So the cache is supposed to control this. But the previous code was using the urn as the cache key but writing to the dirname so for each client the urn is like aff4:/C.12343 and the dirname is aff4: so this means that it wrote it once per each new client.

Can you please verify if this helps?

scudette commented 9 years ago

Hmm, no scratch that! I ran through some tests and this is not the issue. I am taking a closer look.

scudette commented 9 years ago

OK Try the following patch. The above code is actually correct as it tries to update the index for an AFF4 object by writing the index (as well as LAST modification) on its parent. It goes up the directory tree writing the parent until it hits a component already in the cache where we can stop because we are sure we have the index for that component already written.

This is required because we often write new objects blindly and we dont want to wait for reads (async writes). But there are some cases where we know for sure the object already exists in the data store. For example, if we previously opened the object we know it already exists and we assume its indexs are valid. Similarly when we write the client object we can be sure it was previously created (because we know about it) hence we can skip writing indexes in some cases.

The below code adds an option to tell the AFF4Object if it already exists for sure, then we can skip the indexes. You should not see indexs updated any more on the front ends.

--- a/grr/lib/aff4.py
+++ b/grr/lib/aff4.py
@@ -492,10 +497,11 @@ class Factory(object):
                              age=age, ignore_cache=ignore_cache,
                              token=token))

-    # Read the row from the table.
+    # Read the row from the table. We know the object already exists if there is
+    # some data in the local_cache already for this object.
     result = AFF4Object(urn, mode=mode, token=token, local_cache=local_cache,
                         age=age, follow_symlinks=follow_symlinks,
-                        aff4_type=aff4_type)
+                        aff4_type=aff4_type, object_exists=bool(local_cache.get(urn)))

     # Now we have a AFF4Object, turn it into the type it is currently supposed
     # to be as specified by Schema.TYPE.
@@ -1288,7 +1294,7 @@ class AFF4Object(object):

   def __init__(self, urn, mode="r", parent=None, clone=None, token=None,
                local_cache=None, age=NEWEST_TIME, follow_symlinks=True,
-               aff4_type=None):
+               aff4_type=None, object_exists=False):
     if urn is not None:
       urn = rdfvalue.RDFURN(urn)
     self.urn = urn
@@ -1299,6 +1305,10 @@ class AFF4Object(object):
     self.follow_symlinks = follow_symlinks
     self.lock = utils.PickleableLock()

+    # The object already exists in the data store - we do not need to update
+    # indexes.
+    self.object_exists = object_exists
+
     # This flag will be set whenever an attribute is changed that has the
     # creates_new_object_version flag set.
     self._new_version = False
@@ -1506,9 +1516,15 @@ class AFF4Object(object):
             (rdfvalue.RDFString(self.__class__.__name__).SerializeToDataStore(),
              rdfvalue.RDFDatetime().Now())]

+      # We only update indexes if the schema does not forbid it and we are not
+      # sure that the object already exists.
+      add_child_index = self.Schema.ADD_CHILD_INDEX
+      if self.object_exists:
+        add_child_index = False
+
       # Write the attributes to the Factory cache.
       FACTORY.SetAttributes(self.urn, to_set, self._to_delete,
-                            add_child_index=self.Schema.ADD_CHILD_INDEX,
+                            add_child_index=add_child_index,
                             sync=sync, token=self.token)

       # Notify the factory that this object got updated.
@@ -1780,6 +1796,7 @@ class AFF4Object(object):
     # Instantiate the class
     result = cls(self.urn, mode=self.mode, clone=self, parent=self.parent,
                  token=self.token, age=self.age_policy,
+                 object_exists=self.object_exists,
                  follow_symlinks=self.follow_symlinks, aff4_type=self.aff4_type)
     result.Initialize()
pidydx commented 9 years ago

This definitely seems better. It doesn't appear to be having the same issue with aff4:/ + metadata:last or cascading into failure which is definitely better, but I have seen it show the similar behavior with aff4:/hunts/H:XXXXX/Logs + metadata:last, but it seems to recover after a few minutes of really slow access (20-120s queries on that pair). This also doesn't appear to cause the kind of cascading failure as everything starts getting stuck waiting on the datastore.

I have also noticed that the frontends don't appear to respect the threadpool setting and are capable of spinning up thousands of threads under high demand. This may not be directly related, but I wasn't sure if they were supposed to spin up that many threads.

scudette commented 9 years ago

It is possible that the extra threads are spun up by the wsgi server itself - under load the thread pool is full and it pushes back on the writes, so that threads that service clients are blocked waiting for a slot in the pool. If the wsgi server has no reasonable thread management strategies it will just continue to spawn new threads to service requests and they will all be blocked on the pool.

The main symptom is an increase in memory fullprint as more threads = more stack space.

pidydx commented 9 years ago

That makes sense. Our monitoring isn't showing excessive memory usage so we should be good. I pushed the patch to our frontends and workers and started pressing on it pretty hard and it looks good so far.

pidydx commented 9 years ago

This has definitely solved the aff4:/ + metadata:last problem, but the aff4:/hunts/H:XXXXX/Logs + metadata:last UPDATEs appear to be the new bottleneck.

scudette commented 9 years ago

Hmm yeah - so the hunt's Log collection is an AFF4 object of type PackedVersionCollection:

https://github.com/google/grr/blob/35da6a71179a4b3e8631620e9bfa1a2e3fc68e0f/lib/aff4_objects/collections.py#L369

It probably does not need directory indexing at all since we never need to navigate to it (i.e. with ListChildren() of the parent). You can either add:

ADD_CHILD_INDEX = False

to the PackedVersionCollection Schema: https://github.com/google/grr/blob/35da6a71179a4b3e8631620e9bfa1a2e3fc68e0f/lib/aff4_objects/collections.py#L478

This would apply to all packed version collections.

The better option is to assert that this collection already exists in OpenLogsCollection(): https://github.com/google/grr/blob/35da6a71179a4b3e8631620e9bfa1a2e3fc68e0f/lib/flow_runner.py#L230

return aff4.FACTORY.Create(self._GetLogsCollectionURN(logs_collection_urn),
                               "PackedVersionedCollection", mode=mode, object_exists=True,
                               token=self.token)

This will turn indexing off when writing hunt results to the packed collection.

You might need to ensure the object_exists parameter is added to Create() and propagated in the case it is opened for blind write (mode == "w"): https://github.com/google/grr/blob/35da6a71179a4b3e8631620e9bfa1a2e3fc68e0f/lib/aff4.py#L689

result = cls(urn, mode=mode, token=token, age=age, aff4_type=aff4_type, object_exists=object_exists)
pidydx commented 9 years ago

Good news is that this did indeed fix this problem. Bad news is that it kinda kicked the can down the road a bit again.

It occasionally hits snags lock/wait on trying to DELETE aff4:/cron/versioned_collection_compactor + index:changed/aff4:/hunts/H:A5452ACA/Logs. However, this seems to happen with much lower frequency, much lower lock/wait times, shorter bursts of slow queries before clearing up again, and does not appear to significantly impacting the hunt performance at this time.

scudette commented 9 years ago

Are you saying it issues repeated DELETE statements for the same item (aff4:/cron/versioned_collection_compactor + index:changed/aff4:/hunts/H:A5452ACA/Logs). You can see the delete happening here:

https://github.com/google/grr/blob/96f0e5bb863318867502992dc6c7542731c1a9dd/lib/flows/cron/compactors.py#L34

    for urn in collections.PackedVersionedCollection.QueryNotifications(
        timestamp=freeze_timestamp, token=self.token):
      collections.PackedVersionedCollection.DeleteNotifications(
          [urn], end=freeze_timestamp, token=self.token)

So I think QueryNotification() will retrieve all the notifications (one per result) in the timespan but they are all for the same attribute (the logs collection). Maybe this is the problem because we see that it tries to delete each notification alone.

This is the relevant code:

  def QueryNotifications(cls, timestamp=None, token=None):
    """Query all the notifications for the given type of collections."""
    index_predicate = cls.index_format % ".+"
    for _, urn, urn_timestamp in data_store.DB.ResolveRegex(
        cls.notification_queue, index_predicate,
        timestamp=(0, timestamp), token=token):
      yield rdfvalue.RDFURN(urn, age=urn_timestamp)

So maybe convert to a set first:

  def QueryNotifications(cls, timestamp=None, token=None):
    """Query all the notifications for the given type of collections."""
    index_predicate = cls.index_format % ".+"
    result = set()
    for _, urn, urn_timestamp in data_store.DB.ResolveRegex(
        cls.notification_queue, index_predicate,
        timestamp=(0, timestamp), token=token):
      result.add(urn)

   for urn in result:
      yield rdfvalue.RDFURN(urn, age=urn_timestamp)
pidydx commented 9 years ago

This is the query that shows up in the slow query logs as getting wait/lock as multiple threads are trying to do the delete at the same time. This query also deletes multiple rows each time it runs. For the Interrogate hunts we are still noticing that a large gap forms between started/completed and it never seems to close. Could this be another problem where it is deleting more than it should?

DELETE aff4 FROM aff4 WHERE subject_hash=unhex(md5('aff4:/cron/versioned_collection_compactor')) AND attribute_hash=unhex(md5('index:changed/aff4:/hunts/H:28C1498E/Logs'))

scudette commented 9 years ago

Are you sure multiple threads are trying to delete? maybe the threads are trying to write and the query needs to wait on them before the delete? It is also kind of strange here because I would have expected that the deletion be delimited by timestamps. A potential problem here (this is the code which deletes notification from queues) is notifications may be lost if a new notification is written after the delete was issued (for a previous processing run) and then it will be lost in this delete. The code sets end=freeze_timestamp but that does not seem to be honored?

pidydx commented 9 years ago

When it shows up in the slow query log there is a large sequence of DELETE queries that all show large wait/lock times in sequence. There could also be locks due to writes, but without lining up exact start/end times for each query it definitely looks like that multiple deletes are happening at or around the same time when things get very busy.

What we are seeing definitely seems consistent with notifications getting dropped. There was a race condition previously that was causing similar behavior which I'm pretty sure is fixed now. If those queries should have timestamps then it seems like this particular collision is a different bug.

These seem likely candidates, but I'm not seeing anything obvious at the moment. https://github.com/google/grr/blob/master/lib/data_stores/mysql_advanced_data_store.py#L489 https://github.com/google/grr/blob/master/lib/data_stores/mysql_advanced_data_store.py#L537

pidydx commented 9 years ago

I definitely see where that query should have a timestamp associated with it, but looking through the code it looks like it should be applying it even though it apparently isn't.

pidydx commented 9 years ago

I made some changes to _MakeTimestamp to make sure I wasn't clobbering anything reusing variable names, but that doesn't appear to have fixed this. I'm not seeing the DELETE queries in the slow query log, but that doesn't mean they aren't still happening faster than the threshold for that log. I still see online hosts that seem like they have flows that just aren't completing with no errors.

I don't see any reason for those deletes to not be getting a timestamp after the _MakeTimestamp change here: https://github.com/pidydx/grr/commit/98eb01d47eaf65a360db5aa9f7517d31fb6dd36d#diff-6508ba5e3a03d374c263eb6cd19f5908R581

pidydx commented 9 years ago

Unrelated to the timestamp problem, could datastore latency cause a race condition here? Thread 1 goes to insert notifications at timestamp X and has to wait for a connection from the pool or other delay. Thread 2 selects everything earlier than timestamp Y where X < Y, Thread 1 completes the insert, Thread 2 deletes everything earlier than timestamp < Y and winds up dropping notifications.

scudette commented 9 years ago

We are very careful with this to avoid races. The queue management algorithm queries the notifications up to the specific timestamp and then makes sure to only delete notifications up the timestamp it handles. I guess though we are assuming that the timestamp on the notification is exact - the kind of race you describe could occur if the timestamp we are writing the notification at is calculated at the grr server, and then there is latency until it is actually written to the database. It would be best if the timestamp was actually written in the data base itself - say with a stored procedure? I think mysql has timestamp columns where it can timestamp a row by itself?

On 1 June 2015 at 22:01, Sean Gillespie notifications@github.com wrote:

Unrelated to the timestamp problem, could datastore latency cause a race condition here? Thread 1 goes to insert notifications at timestamp X and has to wait for a connection from the pool or other delay. Thread 2 selects everything earlier than timestamp Y where X < Y, Thread 1 completes the insert, Thread 2 deletes everything earlier than timestamp < Y and winds up dropping notifications.

— Reply to this email directly or view it on GitHub https://github.com/google/grr/issues/178#issuecomment-107690097.

pidydx commented 9 years ago

At present all of the open source datastores are using python to set the timestamp before inserting. There are also rows where the timestamp is set to 0. If preserving these 0s is unnecessary I can see about modifying the MySQLAdvanced connector to handle that and reduce the risk of these races.

scudette commented 9 years ago

For those rows with timestamp 0 the intention is that the time is unset - we expect to overwrite previous times. We mostly use these for things which we do not require to keep multiple versions of - for example as we write hunt collections chunks we dont keep older versions (since the collection simply grows). With the sqlite datastore write latency is insignificant but maybe this is an issue for mysql. I seem to recall mysql has a timestamp column type but it acts a little differently than the way we use it - as in if you write the same row it wont keep two versions of it (with different timestamp) but it will update the timestamp column to the latest write time of each. If this is the case we still need to keep both but maybe write the timestamp as it is done currently, but when we read each row we return the timestamp automatically set by the database instead? This way we can have multiple versions which as far as mysql is concerned are different rows, but their timestamps are synchronized to the time they hit the db.

On 2 June 2015 at 00:39, Sean Gillespie notifications@github.com wrote:

At present all of the open source datastores are using python to set the timestamp before inserting. There are also rows where the timestamp is set to 0. If preserving these 0s are unnecessary I can see about modifying the MySQLAdvanced connector to handle that and reduce the risk of these races.

— Reply to this email directly or view it on GitHub https://github.com/google/grr/issues/178#issuecomment-107738457.

pidydx commented 9 years ago

As far as I can tell the SQL connectors only use timestamp as an index for SELECTs and are not using it as a primary key in any way to force uniqueness. A MultiSet using a timestamp of 0 will simply get INSERTed as usual unless replace=True or it also exists in to_delete.

pidydx commented 9 years ago

Can I safely ignore the timestamps being sent to MultiSet and let the datastore itself apply the timestamp on insert? It appears that GRR is setting timestamps all over in the Python and then relying on 0 latency writes to the datastores for these types of deletes.

scudette commented 9 years ago

I think when we use timestamp = 0 we deliberately want to delete the old one (because we only have a single copy) so we also add to the to_delete list. This applied for non versioned AFF4 attributes and also to indexes.

On 2 June 2015 at 01:13, Sean Gillespie notifications@github.com wrote:

As far as I can tell the SQL connectors only use timestamp as an index for SELECTs and are not using it as a primary key in any way to force uniqueness. A MultiSet using a timestamp of 0 will simply get INSERTed as usual unless replace=True or it also exists in to_delete.

— Reply to this email directly or view it on GitHub https://github.com/google/grr/issues/178#issuecomment-107743353.

scudette commented 9 years ago

On 2 June 2015 at 01:28, Sean Gillespie notifications@github.com wrote:

Can I safely ignore the timestamps being sent to MultiSet and let the datastore itself apply the timestamp on insert? It appears that GRR is setting timestamps all over in the Python and then relying on 0 latency writes to the datastores for these types of deletes.

I think this should work. The data store layer assumes that if a timestamp was not given to a MultiSet() that it should be specified itself. According to this document

https://dev.mysql.com/doc/refman/5.0/en/timestamp-initialization.html

The timestamp column is just a column with a default of the current time - so if you dont specify it - it will be set to the current time, but if the calling call specifies a specific timestamp, then it should be honored. It looks like mysql will do this. Maybe just let the NULL propagate to the mysql layer if timestamp is None on MultiSet calls?

pidydx commented 9 years ago

I talked to our DBA and I have the code to be able to INSERT/UPDATE and properly set/update timestamps, but I'm not sure if I can cleanly propagate the None into the queries yet. However, with use of the flusher thread and the delays to insert after timestamps being set by that alone (even ignoring write latency) I think the best thing to do is likely to let MySQL handle all timestamps.

pidydx commented 9 years ago

I have a change to MySQLAdvanced that pushes all of the timestamp generation into the actual MySQL server at INSERT/UPDATE so that clears any write latency or connection pool latency. However, there is still a problem here with clock skew causing the race condition.

The Python should not be generating a timestamp on the GRR server for the select and instead should select the latest available timestamp in the datastore and then using that timestamp for deletion. This should eliminate these race conditions.

scudette commented 9 years ago

Why do you think this is a problem? When the compaction cycle or queue processing operation takes place it uses the timestamp to delimit a range of times it is willing to work with. If those times are a little behind the latest timestamp it should not matter.

Also Misha pointed out the case where we write timestamps in the future so that hunt states will be processed in the future (this is how hunt client rate is implemented). In this case you must allow the python code to dictate when the notification is written instead of letting the data base do it.

pidydx commented 9 years ago

The problems can happen when it goes to delete based on a timestamp that it generated that may not agree with all of the other servers involved.

GRR Server 1 is at time 1 and GRR Server 2 or Datastore is at time 0. GRR Server 1 selects all time < 1 GRR Server 2 or Datastore creates a row with time 0 GRR Server 1 deletes all time < 1

Perhaps the better way to fix this is to use:

      collections.PackedVersionedCollection.DeleteNotifications(
          [urn], end=urn.age, token=self.token)

Are there other places where this happens? As long as the Datastore is stopping inserts with past timestamps getting in then this should be safe. It should only delete things older than the most recent thing that it found in the datastore rather than everything older than what it thinks the current time is since the timestamps on the items in the datastore may disagree.

pidydx commented 9 years ago

"The queue management algorithm queries the notifications up to the specific timestamp and then makes sure to only delete notifications up the timestamp it handles."

If this is true for the queue management then is it true that the PackedVersionCollection is the only place where it is deleting based on the timestamp it used in the select rather than the timestamp it got in the results?

I updated MySQLAdvanced to handle the timestamps in MySQL and that appears to have cleared up the majority of the issues.

To the original issue, I updated to the latest and the aff4:/ + metadata:last collisions appear to be happening again. Everything looks correct so I'm not sure what brought them back.

pidydx commented 9 years ago

Closing in favor of https://github.com/google/grr/issues/200 since multiple issues in this thread were identified and some have been solved.