log2timeline / plaso

Super timeline all the things
https://plaso.readthedocs.io
Apache License 2.0
1.7k stars 334 forks source link

Foreman process exceeds memory limit #2243

Closed joachimmetz closed 5 years ago

joachimmetz commented 5 years ago

While testing on an APFS image with +1.1 M files log2timeline.py hit the 4G foreman process memory limit. Determine what is consuming most of that memory (SQLite?) and see if we can reduce it.

Presumably https://github.com/log2timeline/plaso/issues/2190 is a duplicate.

Onager commented 5 years ago

Some notes:

Onager commented 5 years ago

The extra data is just because the storage file hasn't been closed out yet.

joachimmetz commented 5 years ago

guppy indicates ~1G Tasks (excluding the memory used by related objects) and continued growth over time. Looking into where these references are kept.

joachimmetz commented 5 years ago

This is not necessarily a blocker for release IMHO, but limits the ability to process images with +1M files with the process foreman limit. One can always remove the limit but IMHO that should not be necessary and indicates an underlying issue with memory consumption.

joachimmetz commented 5 years ago

Catch Exception (MemoryError) in plaso/tools/log2timeline.py(67)Main() and execute import pdb; pdb.set_trace()

(Pdb) import objgraph
(Pdb) objgraph.show_most_common_types(limit=20)
dict                         59442
list                         17980
function                     17145
tuple                        10314
AttributeContainerIdentifier 10125
TSKPartitionPathSpec         10025
APFSContainerPathSpec        10022
APFSPathSpec                 10021
OSPathSpec                   10018
RawPathSpec                  10018
Task                         10017
weakref                      4958
type                         3386
builtin_function_or_method   2449
getset_descriptor            2292
wrapper_descriptor           2033
method_descriptor            1588
module                       1545
cell                         1512
property                     1094
import random
import objgraph
objgraph.show_chain(objgraph.find_backref_chain(random.choice(objgraph.by_type('Task')), objgraph.is_proper_module), filename='chain.png')

According to the graph this is TaskManager._task_queued, which is not surprising.

Also memory seems to have dropped to 2G once MemoryError was raised have to try catching the exception elsewhere.

joachimmetz commented 5 years ago
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index 72c12486..83439d4e 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -826,6 +826,11 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
             source_path_specs, storage_writer,
             filter_find_specs=filter_find_specs)

+      except MemoryError:
+        self._StopStatusUpdateThread()
+
+        import pdb; pdb.set_trace()
+
       finally:
         storage_writer.WriteSessionCompletion(aborted=self._abort)

So memory consumption is 4G at this point.

(Pdb) import resource
(Pdb) resource.setrlimit(resource.RLIMIT_DATA, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))
*** ValueError: not allowed to raise maximum limit
prlimit -d --pid 29528
RESOURCE DESCRIPTION         SOFT       HARD UNITS
DATA     max data size 4294967296 4294967296 bytes

sudo prlimit --pid 29528 --data=unlimited

prlimit -d --pid 29528
RESOURCE DESCRIPTION        SOFT      HARD UNITS
DATA     max data size unlimited unlimited bytes
joachimmetz commented 5 years ago
objgraph.show_most_common_types(limit=20)
dict                         5128957
TSKPartitionPathSpec         1023966
APFSContainerPathSpec        1023963
APFSPathSpec                 1023962
OSPathSpec                   1023959
RawPathSpec                  1023959
tuple                        1023431
SQLTableIdentifier           1013945
EventSource                  1013941
list                         17891
function                     17146
AttributeContainerIdentifier 10125
Task                         10017
weakref                      4865
type                         3386
builtin_function_or_method   2424
getset_descriptor            2296
wrapper_descriptor           2045
method_descriptor            1589
module                       1544
joachimmetz commented 5 years ago

looking at individual APFSPathSpec object it does not appear to have back references could this be related to copy.deepcopy() ?

joachimmetz commented 5 years ago

See if the following changes make any difference:

diff --git a/plaso/engine/worker.py b/plaso/engine/worker.py
index bc6dace0..a4d60739 100644
--- a/plaso/engine/worker.py
+++ b/plaso/engine/worker.py
@@ -732,6 +732,8 @@ class EventExtractionWorker(object):
       self._ExtractContentFromDataStream(
           mediator, file_entry, data_stream.name)

+    del path_spec
+
   def _ProcessMetadataFile(self, mediator, file_entry):
     """Processes a metadata file.

Then again this is called in the worker not the foreman.

What about the call to self._path_spec_extractor.ExtractPathSpecs in plaso/multi_processing/task_engine.py?

diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index e5444a1b..03016495 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -343,6 +343,8 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
       if self._status_update_callback:
         self._status_update_callback(self._processing_status)

+      import gc; gc.collect()
+
     self._ScheduleTasks(storage_writer)

     if self._abort:

Unfortunately this did not help.

joachimmetz commented 5 years ago

Remove need for deepcopy?

joachimmetz commented 5 years ago

ExtractPathSpecs is not called often in the test run add gc.collect() to loop in _ScheduleTasks. Counter is to ensure performance impact is not too heavy

diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index e5444a1b..6f0165b7 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -416,6 +416,7 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):

     event_source = event_source_heap.PopEventSource()

+    gc_counter = 0
     task = None
     while event_source or self._task_manager.HasPendingTasks():
       if self._abort:
@@ -456,6 +457,13 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
         if not task and not event_source:
           event_source = event_source_heap.PopEventSource()

+        if gc_counter > 10000:
+          gc_counter = 0
+
+          import gc; gc.collect()
+        else:
+          gc_counter += 1
+
       except KeyboardInterrupt:
         self._abort = True
joachimmetz commented 5 years ago

No luck with the additional gc enforcement, back to the data.

in test _serialized_event_heap contains +150k events but looking at the objgraph data EventSource seems to be 1M objects, and EventSource contains the path specs

event_source serialized attribute list contains 48k objects

Other hypothesis could SQLite be keeping a reference on the event source objects?

https://docs.python.org/2/library/sqlite3.html#sqlite3.connect

It looks like the previously written EventSources which are then queried from the SQLite storage are kept around, slowly filling memory.

(debugging idea use inspect to determine where EventSource objects are allocated)

joachimmetz commented 5 years ago

using:

e = random.choice(objgraph.by_type('EventSource'))
gc.get_referrers(e)

I looks like the object is referenced by:

len(gc.get_referrers(gc.get_referrers(e)[1]))
4

sys.getrefcount(gc.get_referrers(e)[1])
6

Something keeping a hold on the tuples?

gc.get_referrers(gc.get_referrers(e)[1])

returns 3 frames and local "stack"

Is the frame in https://github.com/log2timeline/plaso/blob/master/plaso/multi_processing/task_engine.py#L51 keeping a reference on the tuple?

joachimmetz commented 5 years ago

https://rushter.com/blog/python-garbage-collector/

joachimmetz commented 5 years ago
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index 72c12486..b1c202ba 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -48,11 +48,13 @@ class _EventSourceHeap(object):
       EventSource: event source or None on error.
     """
     try:
-      _, event_source = heapq.heappop(self._heap)
+      heap_tuple = heapq.heappop(self._heap)

     except IndexError:
       return None

+    event_source = heap_tuple[1]
+    del heap_tuple
     return event_source

   def PushEventSource(self, event_source):
joachimmetz commented 5 years ago
e = random.choice(objgraph.by_type('EventSource'))
len(gc.get_referrers(gc.get_referrers(e)[1])[0])
1025943
gc.get_referrers(gc.get_referrers(gc.get_referrers(e)[1])[0])
... , (100, <plaso.containers.event_sources.EventSource object at 0x7f3fe3b67b90>)], '_maximum_number_of_items': 50000}]
len(myz[1]['_heap'])
1025943

Something wrong with how HeapFull is handled?

joachimmetz commented 5 years ago
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index 72c12486..f3d8a085 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -41,6 +41,14 @@ class _EventSourceHeap(object):
     self._heap = []
     self._maximum_number_of_items = maximum_number_of_items

+  def IsFull(self):
+    """Determines if the heap is full.
+
+    Returns:
+      bool: True if the heap is full.
+    """
+    return len(self._heap) >= self._maximum_number_of_items
+
   def PopEventSource(self):
     """Pops an event source from the heap.

@@ -74,7 +82,7 @@ class _EventSourceHeap(object):
     heap_values = (weight, event_source)
     heapq.heappush(self._heap, heap_values)

-    if len(self._heap) >= self._maximum_number_of_items:
+    if self.IsFull():
       raise errors.HeapFull()

@@ -451,7 +459,8 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):

         self._MergeTaskStorage(storage_writer)

-        self._FillEventSourceHeap(storage_writer, event_source_heap)
+        if not event_source_heap.IsFull():
+          self._FillEventSourceHeap(storage_writer, event_source_heap)

         if not task and not event_source:
           event_source = event_source_heap.PopEventSource()
joachimmetz commented 5 years ago

test run with patch, now has the ~2M event sources and ~1G memory usage

Test completed

joachimmetz commented 5 years ago

https://github.com/log2timeline/plaso/pull/2316