Analyze memory consumption of text parsers on large files

joachimmetz commented 6 years ago

It looks like the sccm parser might be consuming a vast amount of memory when (trying) to parse large files

joachimmetz commented 6 years ago

have a look at /Recovery/WindowsRE/Winre.wim on studentpc10 test image

joachimmetz commented 6 years ago

plaso - log2timeline version 20180423

Source path : DataStore.edb
Source type : single file

Identifier      PID     Status          Memory          Sources         Events          File
Main            4520    completed       1.1 GiB         1 (0)           3 (0)           

Processing completed.

joachimmetz commented 6 years ago

    GB
3.066^                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                      #                 
     |                                                     @#    :     :     :
     |                                                    :@#   ::    ::    ::
     |                                                    :@#   ::    ::   @::
     |                                                   ::@#  :::   :::  :@::
     |                                                  @::@# ::::  ::::  :@::
     |                                                 :@::@#::::: @:::: ::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
     |                                                 :@::@#::::: @:::::::@::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   70.56

joachimmetz commented 6 years ago

Time	Name	Used memory
1524549899.68247	dockerjson	109830144
1524549901.53516	symantec_scanlog	1179471872

significant memory usage increase after symantec_scanlog was run

joachimmetz commented 6 years ago

import csv
from plaso.lib import line_reader_file
from dfvfs.path import factory
from dfvfs.lib import definitions
from dfvfs.resolver import resolver

os_path_spec = factory.Factory.NewPathSpec(definitions.TYPE_INDICATOR_OS, location='DataStore.edb')
file_object = resolver.Resolver.OpenFileObject(os_path_spec)
file_object = resolver.Resolver.OpenFileObject(os_path_spec)

l = line_reader_file.BinaryLineReader(file_object)
x = csv.DictReader(l)
y = next(x)

Call to next row on csv seems to +1G of memory: https://github.com/log2timeline/plaso/blob/master/plaso/parsers/dsv_parser.py#L138

Some context: http://stupidpythonideas.blogspot.ch/2014/09/why-does-my-100mb-file-take-2gb-of.html

This issue seems to be integral to the csv Python module. Options:

limit the maximum file size the parser is allowed to handle
rewrite the dsv parser
only support text files (byte frequency analysis)
- this might be useful functionality, it does not solve the problem for large text files

joachimmetz commented 6 years ago

Limit maximum file size supported by DSV parser: https://codereview.appspot.com/348730043

Memory consumption remains sane (for the test file)

Time	Name	Used memory
1524636182.10849	mcafee_protection	109965312
1524636182.1088	gdrive_synclog	109965312

log2timeline / plaso

Analyze memory consumption of text parsers on large files #1812