Lazy loading of item fields

sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.

Other

957 stars 219 forks source link

Lazy loading of item fields #512

Open lfcnassif opened 3 years ago

lfcnassif commented 3 years ago

This is an old idea. When a new item is created, all indexed and stored fields are loaded at once. But sometimes parsers just want to query one or two item properties, possibly wasting time. I did not detect large bottlenecks with "load everything" approach in past tests and because this is a sensible change, I did not push. @fmpfeifer if you think current approach is a bottleneck on #486, I can push the experimental local branch for testing.

fmpfeifer commented 3 years ago

I'm seeing a performance cost while trying to solve #486, apparently not for extra hash calculations, but for the extra indexed searches for media files. If you have this implemented, maybe it could be a good idea to test and see if it improves.

lfcnassif commented 3 years ago

I will push the experimental branch tomorrow. Another idea is to query multiple items at the same time using an OR query, I don't know if you have tried this. I use jvisualvm "Sampler->CPU" feature to measure method calls cost, helps a lot to identify bottlenecks.

fmpfeifer commented 3 years ago

I was thinking exactly that a few minutes ago. Do a big query in advance, instead of one query per item. And see what happens.. will try that tomorrow

lfcnassif commented 3 years ago

@fmpfeifer I just pushed the experimental lazy_load_fields branch after resolving some merge conflicts with master

fmpfeifer commented 3 years ago

Tested it here on to of my whatsapp-parser-bugfix branch. It improved the processing time.

p.s.: this test was done processing only "ChatStorage.sqlite" files on top of a pre-processed case with all other data (using --append) option.

lfcnassif commented 3 years ago

Good, thank you! And about specific ParsingTask time a bit above those lines?

fmpfeifer commented 3 years ago

not so much:

fmpfeifer commented 3 years ago

didn't understand the math here:

lfcnassif commented 3 years ago

Those stats were thought for cases with many items. Total task time per thread is measured, all are added and divided by numThreads at the end. As your case is just 1 item, the real time should be multiplied by numThreads. Or the performance test could be done with just 1 thread. I asked about ParsingTask time to exclude other things, like graph generation.

fmpfeifer commented 3 years ago

I see.. I disabled graph generation to test this

lfcnassif commented 3 years ago

So, will not merge this for now, thanks!

fmpfeifer commented 3 years ago

ok.. one more thing.. I compared the result of both processing, and with the lazy_load I noticed that the item.getTypeExt() is not returning the correct file extension.

the function dpf.sp.gpinf.indexer.parsers.util.Util.getExportPath(IItemBase) is always returnig the file without extension. Either item.getTypeExt() is returning null or ""

lfcnassif commented 3 years ago

Hum will take a look, thanks!

lfcnassif commented 3 years ago

Commit above should fix this.