pdalcol / Zippex

Native Apex Zip library for Salesforce.com
MIT License
143 stars 108 forks source link

Add benchmark scripts to measure decoding perf #20

Open xoob opened 7 years ago

xoob commented 7 years ago

This PR adds Benchmark scripts to measure performance of different mechanisms for decoding binary data to integers. Each method has a comment explaining the implementation and its effect.

public static List<Integer> Benchmark(Blob data);

To run the benchmark, call Benchmark.Run(10); from the execute anonymous window.

Benchmark Typical Effect
1(a) Lookup from the M_HEX_DEC map. 176 ms worse
1(b) Like 1(a) but lookup from a list. 110 ms similar
1(c) Like 1(a) but instead of iterating over a list from getChars(), call mid() for each char. Lookup the by string instead of int. 400 ms much worse
2(a) Convert with the original HexUtil formula which uses the underlying ASCII math. 110 ms baseline
2(b) Like 2(a) but try a different formula to add the values of c[0] and c[1]. 109 ms similar
2(c) Like 2(a) but write to the original list and then truncate its second half. This is an attempt to allocate less memory. 132 ms worse
2(d) Like 2(a) but instead of iterating over a list from getChars(), call charAt() for each position. 309 ms much worse
3(a) Use if-conditions to map character to decimal. 147 ms worse
3(b) Like 3(a) but use if-else-conditions to map character to decimal. 115 ms similar
4(a) Use a regular expression to prefix every two characters with the escape sequence "\u00XX" and then call getChars() to convert it to a list of byte values. 6 ms much better
4(b) Like 4(a) but more compact. 6 ms much better

Surprisingly, I discovered that this is by far the fastest and most efficient way to convert a Blob into a List of Integers:

public static List<Integer> Benchmark4B(Blob data) {       
    return EncodingUtil.convertToHex(data).replaceAll('(..)', '\\\\u00$1').getChars();
}

It makes use of the fact that two hex characters can be turned into a unicode escape sequence, which is then easily converted into integers by getChars(). The logic works as follows:

'504B0304'  ->  '\u0050\u004B\u0003\u0004'  ->  {80, 75, 3, 4}

This was a ton of fun exploring and I'd love to spend some time talking about it in more depth @pdalcol and @smithpliny. Hope this can be useful for the project.

pdalcol commented 7 years ago

@xoob this is brilliant! Especially benchmark 4. I am very surprised that 1(b) was not the fastest method given how simple it is. Good insight into how Apex is compiled/executed. It would be great to discuss further. I'll email you to setup a call.

devshane commented 7 years ago

Hello and thanks for the lib! Have these improvements been implemented? I'm running into CPU time limit exceeded errors with ~5mb zips and wondered if the changes would have any significant effect. #16 suggests the acceptable zip file sizes are very small.

Zippex.unzipAttachment() is a dream compared to the alternatives...

pdalcol commented 7 years ago

@devshane we haven't yet incorporated these optimizations into the released code but I have partially implemented it for testing purposes and it does perform better in some specific cases like unzipping .jpg files. In other cases it slows things down because by switching to arrays we lose access to high level String operations. As a result, many times we end up having to iterate through every item in a large array instead of being able to call one String method.

Most likely we need to figure out a hybrid approach. @xoob and I are planing to meet to brainstorm but I have been swamped. In the mean time, if possible, you can unzip in asynchronous mode to get a larger CPU limit. In my projects I make a separate future call for each file that I am unzipping. So for instance if there are four files in a Zip, I spin four future methods to process one file each. Queueable would work too.

devshane commented 7 years ago

@pdalcol Thanks, I'll give futures a shot.