work-around to increase speed for large database files

PassMark commented 4 years ago

Thanks for the effort in developing the library.

This issue is a duplicate of issue #2 and #40 except that we are offering a possible solution to improve the performance.

Problem: Performance was truly abysmal on large files. Performance falls off a cliff once the file size reaches a certain point. We didn't measure the exact location of the 'cliff' but a 3GB file was acceptable, while a 6GB (400K records) file wasn't. Time to dump one table from the 6GB file was 3+ hours on high end hardware.

Solution: Increased LIBESEDB_MAXIMUM_CACHE_ENTRIES_TABLE_VALUES from 32k to 128k. The cache is implemented as a hash table, but when the hash table is starting to get full entries are ejected from the cache only to be re-read multiple times later on. Making the hash table larger fixes this by avoiding hash table collisions.

In cases where a table has a large number of columns, like in the Windows.edb file with 600 columns, performance was further improved by replacing column catalog linked list with array. e.g.

Replace, libcdata_list_initialize(&( ( table_definition )->column_catalog_definition_list ) with libcdata_array_initialize(&( ( table_definition )->column_catalog_definition_list )

These changes gave around a 40x speed improvement on large files. Scan time went from 3h to 3min. Of course the trade off is slightly increased RAM usage.

joachimmetz commented 4 years ago

thanks for the proposed work-around but as you indicate this is a duplicate of https://github.com/libyal/libesedb/issues/2, hence closing.

Tweaking with the cache can work to improve performance but comes at the cost of using more resources. As indicated in https://github.com/libyal/libesedb/issues/2 a structural solution is needed for very large database files.

c0d3z3r0 commented 4 years ago

@PassMark where exactly did you replace list by array?

libyal / libesedb

work-around to increase speed for large database files #48