Add benchmark scripts to measure decoding perf

xoob commented 7 years ago

This PR adds Benchmark scripts to measure performance of different mechanisms for decoding binary data to integers. Each method has a comment explaining the implementation and its effect.

public static List<Integer> Benchmark(Blob data);

To run the benchmark, call Benchmark.Run(10); from the execute anonymous window.

	Benchmark	Typical	Effect
1(a)	Lookup from the `M_HEX_DEC` map.	176 ms	worse
1(b)	Like 1(a) but lookup from a list.	110 ms	similar
1(c)	Like 1(a) but instead of iterating over a list from `getChars()`, call `mid()` for each char. Lookup the by string instead of int.	400 ms	much worse
2(a)	Convert with the original `HexUtil` formula which uses the underlying ASCII math.	110 ms	baseline
2(b)	Like 2(a) but try a different formula to add the values of `c[0]` and `c[1]`.	109 ms	similar
2(c)	Like 2(a) but write to the original list and then truncate its second half. This is an attempt to allocate less memory.	132 ms	worse
2(d)	Like 2(a) but instead of iterating over a list from `getChars()`, call charAt() for each position.	309 ms	much worse
3(a)	Use if-conditions to map character to decimal.	147 ms	worse
3(b)	Like 3(a) but use if-else-conditions to map character to decimal.	115 ms	similar
4(a)	Use a regular expression to prefix every two characters with the escape sequence `"\u00XX"` and then call `getChars()` to convert it to a list of byte values.	6 ms	much better
4(b)	Like 4(a) but more compact.	6 ms	much better

Surprisingly, I discovered that this is by far the fastest and most efficient way to convert a Blob into a List of Integers:

public static List<Integer> Benchmark4B(Blob data) {       
    return EncodingUtil.convertToHex(data).replaceAll('(..)', '\\\\u00$1').getChars();
}

It makes use of the fact that two hex characters can be turned into a unicode escape sequence, which is then easily converted into integers by getChars(). The logic works as follows:

'504B0304'  ->  '\u0050\u004B\u0003\u0004'  ->  {80, 75, 3, 4}

This was a ton of fun exploring and I'd love to spend some time talking about it in more depth @pdalcol and @smithpliny. Hope this can be useful for the project.

pdalcol commented 7 years ago

@xoob this is brilliant! Especially benchmark 4. I am very surprised that 1(b) was not the fastest method given how simple it is. Good insight into how Apex is compiled/executed. It would be great to discuss further. I'll email you to setup a call.

devshane commented 7 years ago

Hello and thanks for the lib! Have these improvements been implemented? I'm running into CPU time limit exceeded errors with ~5mb zips and wondered if the changes would have any significant effect. #16 suggests the acceptable zip file sizes are very small.

Zippex.unzipAttachment() is a dream compared to the alternatives...

pdalcol commented 7 years ago

@devshane we haven't yet incorporated these optimizations into the released code but I have partially implemented it for testing purposes and it does perform better in some specific cases like unzipping .jpg files. In other cases it slows things down because by switching to arrays we lose access to high level String operations. As a result, many times we end up having to iterate through every item in a large array instead of being able to call one String method.

Most likely we need to figure out a hybrid approach. @xoob and I are planing to meet to brainstorm but I have been swamped. In the mean time, if possible, you can unzip in asynchronous mode to get a larger CPU limit. In my projects I make a separate future call for each file that I am unzipping. So for instance if there are four files in a Zip, I spin four future methods to process one file each. Queueable would work too.

devshane commented 7 years ago

@pdalcol Thanks, I'll give futures a shot.

pdalcol / Zippex

Add benchmark scripts to measure decoding perf #20