log2timeline / plaso

Super timeline all the things
https://plaso.readthedocs.io
Apache License 2.0
1.73k stars 349 forks source link

Analyze memory consumption of text parsers on large files #1812

Closed joachimmetz closed 6 years ago

joachimmetz commented 6 years ago

It looks like the sccm parser might be consuming a vast amount of memory when (trying) to parse large files

joachimmetz commented 6 years ago

have a look at /Recovery/WindowsRE/Winre.wim on studentpc10 test image

joachimmetz commented 6 years ago
plaso - log2timeline version 20180423

Source path : DataStore.edb
Source type : single file

Identifier      PID     Status          Memory          Sources         Events          File
Main            4520    completed       1.1 GiB         1 (0)           3 (0)           

Processing completed.
joachimmetz commented 6 years ago
    GB
3.066^                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                     @#    :     :     :
     |                                                    :@#   ::    ::    ::
     |                                                    :@#   ::    ::   @::
     |                                                   ::@#  :::   :::  :@::
     |                                                  @::@# ::::  ::::  :@::
     |                                                 :@::@#::::: @:::: ::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   70.56
joachimmetz commented 6 years ago
Time Name Used memory
1524549899.68247 dockerjson 109830144
1524549901.53516 symantec_scanlog 1179471872

significant memory usage increase after symantec_scanlog was run

joachimmetz commented 6 years ago
import csv
from plaso.lib import line_reader_file
from dfvfs.path import factory
from dfvfs.lib import definitions
from dfvfs.resolver import resolver

os_path_spec = factory.Factory.NewPathSpec(definitions.TYPE_INDICATOR_OS, location='DataStore.edb')
file_object = resolver.Resolver.OpenFileObject(os_path_spec)
file_object = resolver.Resolver.OpenFileObject(os_path_spec)

l = line_reader_file.BinaryLineReader(file_object)
x = csv.DictReader(l)
y = next(x)

Call to next row on csv seems to +1G of memory: https://github.com/log2timeline/plaso/blob/master/plaso/parsers/dsv_parser.py#L138

Some context: http://stupidpythonideas.blogspot.ch/2014/09/why-does-my-100mb-file-take-2gb-of.html

This issue seems to be integral to the csv Python module. Options:

joachimmetz commented 6 years ago

Limit maximum file size supported by DSV parser: https://codereview.appspot.com/348730043

Memory consumption remains sane (for the test file)

Time Name Used memory
1524636182.10849 mcafee_protection 109965312
1524636182.1088 gdrive_synclog 109965312