Closed PassMark closed 4 years ago
thanks for the proposed work-around but as you indicate this is a duplicate of https://github.com/libyal/libesedb/issues/2, hence closing.
Tweaking with the cache can work to improve performance but comes at the cost of using more resources. As indicated in https://github.com/libyal/libesedb/issues/2 a structural solution is needed for very large database files.
@PassMark where exactly did you replace list by array?
Thanks for the effort in developing the library.
This issue is a duplicate of issue #2 and #40 except that we are offering a possible solution to improve the performance.
Problem: Performance was truly abysmal on large files. Performance falls off a cliff once the file size reaches a certain point. We didn't measure the exact location of the 'cliff' but a 3GB file was acceptable, while a 6GB (400K records) file wasn't. Time to dump one table from the 6GB file was 3+ hours on high end hardware.
Solution: Increased LIBESEDB_MAXIMUM_CACHE_ENTRIES_TABLE_VALUES from 32k to 128k. The cache is implemented as a hash table, but when the hash table is starting to get full entries are ejected from the cache only to be re-read multiple times later on. Making the hash table larger fixes this by avoiding hash table collisions.
In cases where a table has a large number of columns, like in the Windows.edb file with 600 columns, performance was further improved by replacing column catalog linked list with array. e.g.
Replace, libcdata_list_initialize(&( ( table_definition )->column_catalog_definition_list ) with libcdata_array_initialize(&( ( table_definition )->column_catalog_definition_list )
These changes gave around a 40x speed improvement on large files. Scan time went from 3h to 3min. Of course the trade off is slightly increased RAM usage.