Suggestions for performance optimization on split E01 images

PassMark commented 5 years ago

We noticed slow performance on split E01 images in release 20171104

For example listing 50,000 files names in a E01 took 153 seconds. Where as doing the same with the original physical drive took a few seconds.

After a couple of days of tracing the problem, code profiling & making some test cases, we concluded that ,

The cache is too small. The cache size LIBEWF_MAXIMUM_CACHE_ENTRIES_CHUNK_GROUPS is currently just 16 entries. We suggest making it at least 256 entries.
The hash calculation for the cache entries doesn't take into account the split file numbers. We suggest changing the hash function to something like this in libfdata_list.c

// Calculate the cache entry index value
// Reserve the 8 least significant bits for element_file_index. Or with element_index shifted by 8
static int libfdata_calculate_cache_entry_index(
    int element_index,
    int element_file_index,
    off64_t element_offset,
    size64_t element_size,
    uint32_t element_flags,
    int number_of_cache_entries)
{
    return ((element_index << 8) | (element_file_index & 0xFF)) % number_of_cache_entries;
}

Making these changes gave a 20x performance improvement in some test cases. 153 seconds reduced to 7 seconds.

Might also be related to this issue. https://github.com/libyal/libewf/issues/74
as the poor caching will result in libewf_handle_read_buffer_at_offset being called too often.

joachimmetz commented 5 years ago

@PassMark thx for the suggestions, note that version 20171104 of this project is experimental and has not undergone performance optimization.

Might also be related to this issue. #74 as the poor caching will result in libewf_handle_read_buffer_at_offset being called too often.

Is this speculation or did you observe this behavior in your tests?

PassMark commented 5 years ago

Relation to #74 is speculation.

Investigation went like this,

Noticed slow performance with a particular E01 test image. Even on fastest NVMe SSD available, high end CPU & massive amount of RAM.
Confirmed issues on a couple of different hardware platforms.
Profiled code, which initially pointed to common_timespec_get(). But this turned out to be less significant in release code and was only a big issue in debug code. So a red herring.
Release build profiling blamed RrlReAllocateHeap() & RtlpAloocateHeapInternal(). There was a massive amount of memory allocation and reallocation. 98% of CPU time was in memory allocation / reallocation. Only ~1% of time was actually reading the disk. But this was kind of a red herring as well.
The massive amount of memory allocations was in fact due to a caching problem. Lifting the cache size to 500,000 elements mostly fixed the problem, but was expensive in terms of memory usage.
Then we noticed that the cache was thrashing with the split files, as the hash algorithm wasn't very good for split files.
Fixing the hash allowed the cache size to be reduced to a more reasonable level.
Got 20x the speed. Hopefully we didn't break anything else.

PassMark commented 5 years ago

Here is the profiling output (release mode code).

PassMark commented 5 years ago

Hopefully we didn't break anything else.

Turns out the new hash algorithm was good for split files, but not so good for single monolithic E01 files. So there was a performance regression.

We revised it again. This time using the 64 bit to 32 bit Hash Function here https://gist.github.com/badboy/6267743

New version looks like this,


int libewf_segment_file_calculate_cache_entry_index(
    int element_index,
    int element_file_index,
    off64_t element_offset,
    size64_t element_size,
    uint32_t element_flags,
    int number_of_cache_entries)
{
    uint64_t key = ((uint64_t)element_index << 32) | (uint32_t)element_file_index;
    key = (~key) + (key << 18); // key = (key << 18) - key - 1;
    key = key ^ (key >> 31);
    key = key * 21; // key = (key + (key << 2)) + (key << 4);
    key = key ^ (key >> 11);
    key = key + (key << 6);
    key = key ^ (key >> 22);
    return (int)(key % number_of_cache_entries);
}

This seems to restore the single file performance to what it was and also fix the multi-file performance issue.

joachimmetz commented 3 years ago

Various optimization have been made in the mean time I'll do some speed comparison tests with RAW and libewf legacy

joachimmetz commented 3 years ago

Cache now MRU based and initial comparison of random-access reads in https://github.com/libyal/libewf/commit/3b4894e9b6a49d737d3e913ddc91267343f1be8d indicate comparable speeds with libewf-legacy. More performance testing will be done at a later stage. Closing issue.

libyal / libewf

Suggestions for performance optimization on split E01 images #124