Closed joachimmetz closed 5 years ago
Some notes:
The extra data is just because the storage file hasn't been closed out yet.
guppy indicates ~1G Tasks (excluding the memory used by related objects) and continued growth over time. Looking into where these references are kept.
This is not necessarily a blocker for release IMHO, but limits the ability to process images with +1M files with the process foreman limit. One can always remove the limit but IMHO that should not be necessary and indicates an underlying issue with memory consumption.
Catch Exception (MemoryError) in plaso/tools/log2timeline.py(67)Main() and execute import pdb; pdb.set_trace()
(Pdb) import objgraph
(Pdb) objgraph.show_most_common_types(limit=20)
dict 59442
list 17980
function 17145
tuple 10314
AttributeContainerIdentifier 10125
TSKPartitionPathSpec 10025
APFSContainerPathSpec 10022
APFSPathSpec 10021
OSPathSpec 10018
RawPathSpec 10018
Task 10017
weakref 4958
type 3386
builtin_function_or_method 2449
getset_descriptor 2292
wrapper_descriptor 2033
method_descriptor 1588
module 1545
cell 1512
property 1094
import random
import objgraph
objgraph.show_chain(objgraph.find_backref_chain(random.choice(objgraph.by_type('Task')), objgraph.is_proper_module), filename='chain.png')
According to the graph this is TaskManager._task_queued, which is not surprising.
Also memory seems to have dropped to 2G once MemoryError was raised have to try catching the exception elsewhere.
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index 72c12486..83439d4e 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -826,6 +826,11 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
source_path_specs, storage_writer,
filter_find_specs=filter_find_specs)
+ except MemoryError:
+ self._StopStatusUpdateThread()
+
+ import pdb; pdb.set_trace()
+
finally:
storage_writer.WriteSessionCompletion(aborted=self._abort)
So memory consumption is 4G at this point.
(Pdb) import resource
(Pdb) resource.setrlimit(resource.RLIMIT_DATA, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))
*** ValueError: not allowed to raise maximum limit
prlimit -d --pid 29528
RESOURCE DESCRIPTION SOFT HARD UNITS
DATA max data size 4294967296 4294967296 bytes
sudo prlimit --pid 29528 --data=unlimited
prlimit -d --pid 29528
RESOURCE DESCRIPTION SOFT HARD UNITS
DATA max data size unlimited unlimited bytes
objgraph.show_most_common_types(limit=20)
dict 5128957
TSKPartitionPathSpec 1023966
APFSContainerPathSpec 1023963
APFSPathSpec 1023962
OSPathSpec 1023959
RawPathSpec 1023959
tuple 1023431
SQLTableIdentifier 1013945
EventSource 1013941
list 17891
function 17146
AttributeContainerIdentifier 10125
Task 10017
weakref 4865
type 3386
builtin_function_or_method 2424
getset_descriptor 2296
wrapper_descriptor 2045
method_descriptor 1589
module 1544
looking at individual APFSPathSpec object it does not appear to have back references could this be related to copy.deepcopy()
?
See if the following changes make any difference:
diff --git a/plaso/engine/worker.py b/plaso/engine/worker.py
index bc6dace0..a4d60739 100644
--- a/plaso/engine/worker.py
+++ b/plaso/engine/worker.py
@@ -732,6 +732,8 @@ class EventExtractionWorker(object):
self._ExtractContentFromDataStream(
mediator, file_entry, data_stream.name)
+ del path_spec
+
def _ProcessMetadataFile(self, mediator, file_entry):
"""Processes a metadata file.
Then again this is called in the worker not the foreman.
What about the call to self._path_spec_extractor.ExtractPathSpecs
in plaso/multi_processing/task_engine.py
?
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index e5444a1b..03016495 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -343,6 +343,8 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
if self._status_update_callback:
self._status_update_callback(self._processing_status)
+ import gc; gc.collect()
+
self._ScheduleTasks(storage_writer)
if self._abort:
Unfortunately this did not help.
Remove need for deepcopy?
ExtractPathSpecs is not called often in the test run add gc.collect() to loop in _ScheduleTasks
. Counter is to ensure performance impact is not too heavy
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index e5444a1b..6f0165b7 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -416,6 +416,7 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
event_source = event_source_heap.PopEventSource()
+ gc_counter = 0
task = None
while event_source or self._task_manager.HasPendingTasks():
if self._abort:
@@ -456,6 +457,13 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
if not task and not event_source:
event_source = event_source_heap.PopEventSource()
+ if gc_counter > 10000:
+ gc_counter = 0
+
+ import gc; gc.collect()
+ else:
+ gc_counter += 1
+
except KeyboardInterrupt:
self._abort = True
No luck with the additional gc enforcement, back to the data.
in test _serialized_event_heap
contains +150k events but looking at the objgraph data EventSource seems to be 1M objects, and EventSource contains the path specs
event_source serialized attribute list contains 48k objects
Other hypothesis could SQLite be keeping a reference on the event source objects?
https://docs.python.org/2/library/sqlite3.html#sqlite3.connect
It looks like the previously written EventSources which are then queried from the SQLite storage are kept around, slowly filling memory.
(debugging idea use inspect to determine where EventSource objects are allocated)
using:
e = random.choice(objgraph.by_type('EventSource'))
gc.get_referrers(e)
I looks like the object is referenced by:
len(gc.get_referrers(gc.get_referrers(e)[1]))
4
sys.getrefcount(gc.get_referrers(e)[1])
6
Something keeping a hold on the tuples?
gc.get_referrers(gc.get_referrers(e)[1])
returns 3 frames and local "stack"
Is the frame in https://github.com/log2timeline/plaso/blob/master/plaso/multi_processing/task_engine.py#L51 keeping a reference on the tuple?
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index 72c12486..b1c202ba 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -48,11 +48,13 @@ class _EventSourceHeap(object):
EventSource: event source or None on error.
"""
try:
- _, event_source = heapq.heappop(self._heap)
+ heap_tuple = heapq.heappop(self._heap)
except IndexError:
return None
+ event_source = heap_tuple[1]
+ del heap_tuple
return event_source
def PushEventSource(self, event_source):
e = random.choice(objgraph.by_type('EventSource'))
len(gc.get_referrers(gc.get_referrers(e)[1])[0])
1025943
gc.get_referrers(gc.get_referrers(gc.get_referrers(e)[1])[0])
... , (100, <plaso.containers.event_sources.EventSource object at 0x7f3fe3b67b90>)], '_maximum_number_of_items': 50000}]
len(myz[1]['_heap'])
1025943
Something wrong with how HeapFull is handled?
diff --git a/plaso/multi_processing/task_engine.py b/plaso/multi_processing/task_engine.py
index 72c12486..f3d8a085 100644
--- a/plaso/multi_processing/task_engine.py
+++ b/plaso/multi_processing/task_engine.py
@@ -41,6 +41,14 @@ class _EventSourceHeap(object):
self._heap = []
self._maximum_number_of_items = maximum_number_of_items
+ def IsFull(self):
+ """Determines if the heap is full.
+
+ Returns:
+ bool: True if the heap is full.
+ """
+ return len(self._heap) >= self._maximum_number_of_items
+
def PopEventSource(self):
"""Pops an event source from the heap.
@@ -74,7 +82,7 @@ class _EventSourceHeap(object):
heap_values = (weight, event_source)
heapq.heappush(self._heap, heap_values)
- if len(self._heap) >= self._maximum_number_of_items:
+ if self.IsFull():
raise errors.HeapFull()
@@ -451,7 +459,8 @@ class TaskMultiProcessEngine(engine.MultiProcessEngine):
self._MergeTaskStorage(storage_writer)
- self._FillEventSourceHeap(storage_writer, event_source_heap)
+ if not event_source_heap.IsFull():
+ self._FillEventSourceHeap(storage_writer, event_source_heap)
if not task and not event_source:
event_source = event_source_heap.PopEventSource()
test run with patch, now has the ~2M event sources and ~1G memory usage
Test completed
While testing on an APFS image with +1.1 M files log2timeline.py hit the 4G foreman process memory limit. Determine what is consuming most of that memory (SQLite?) and see if we can reduce it.
Presumably https://github.com/log2timeline/plaso/issues/2190 is a duplicate.