kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.02k stars 197 forks source link

Benchmarks should run from memory, not disk #387

Open KOLANICH opened 6 years ago

KOLANICH commented 6 years ago

If we run them from disk we measure disk performance and side effects like disk caches and/or acoustic noise in the room test is done in.

I have measured refactored python runtime code using the following code:

from timeit import timeit
from io import BytesIO
import numpy as np
import json
from kaitaistruct import KaitaiStream

binsCount=0x1000
count=0x10000
b=BytesIO(bytes(range(256))*(count//256))
s=KaitaiStream(b)

def init():
    b.seek(0)

def measure():
    for i in range(count//2):
        s.read_s2le()

bins=np.zeros(binsCount)
for i in range(binsCount):
    bins[i]=timeit(measure, init, number=1)

print(json.dumps(bins.tolist()))

I guess that measuring code definitely has something to borrow from my code.

arekbulski commented 6 years ago

I already raised this issue with @GreyCat when patching test suite, although I dont remember why it was dismissed. We should implement the tests using timeit (average or min over repeated measurements) and that would use OS caching and therefore eliminate this problem as well. Meaning the fix for that problem would also solve this problem, supercedes it.

GreyCat commented 6 years ago

Let me get it for you: https://github.com/kaitai-io/kaitai_struct_benchmarks/pull/1#discussion_r162825217

KOLANICH commented 6 years ago

that's usually more hassle to program benchmarks in such a way

I have rewritten python benchmark, including reading from disk into memory, but it seems like that there is a memory leak somewhere, I run out of memory when reading the same stream multiple times and discading the results.

we actually want it to be fast enough (i.e. use buffered reading), so we want reading to happen from actual file stream, not memory stream; if it is slow, then it should be fixed, not eliminated from the equation by using in-memory stream.

I think we need to measure raw performance and performance with side effects separately.

Also I guess we need to warm up memory somehow, I mean ensure that malloc makes no syscalls, the needed memory is already mapped to address space and page table entries are already in tlb, but the memory is marked as free in the memory manager inside of glibc. The latter is achieved by reading, but how to achieve the former then, I mean if we mark them as free, glibc memory manager may make a syscall to unmap them.

GreyCat commented 6 years ago

I think we need to measure raw performance and performance with side effects separately.

The problem is that it's hard to define "raw performance". You're mentioning several "warm up"-like issues, even in C or Python environment. Managed environments like JVM, .NET VM, or nodejs sport tons of such issues. Some languages have different notions of what holding data "in memory" is — i.e. byte arrays, typed arrays, buffers, StringIOs, etc. Basically, any "preparation" or whatever, brings some noise in the measurements.

I agree that ideally, we should check all possible paths, i.e. "from disk using unbuffered file read calls", "from disk using buffered reads", "from disk using mmap", "from in-memory byte array", "from in-memory Uint8Array", "from in-memory string", etc. But that does not mean that we should forfeit the "from disk" benchmarks — on the contrary, we need them to assess potential benefits from more complex solutions like mmaped files.

KOLANICH commented 6 years ago

The problem is that it's hard to define "raw performance".

It's easy to define, but impossible to achieve. "Raw performance" is performance with all possible side effects on it eliminated. Side effects are everything except algorithm complexity.

Basically, any "preparation" or whatever, brings some noise in the measurements.

Noise is precision, warming up is to reduce noise possible at cost of some accuracy.

But that does not mean that we should forfeit the "from disk" benchmarks — on the contrary, we need them to assess potential benefits from more complex solutions like mmaped files.

I agree.

GreyCat commented 6 years ago

Side effects are everything except algorithm complexity.

That's radically different from classic definition of a side effect, and, what's more important, it's not clear at all. "Algorithm complexity" is typically a metric, so I can't really imagine how you can define "everything expect a metric".

My point is that even if you start partitioning stuff that happens in 2 bins — "good" (=raw) and "bad" (=not raw), it's not easy to tell what goes where and why. For example, you say that "disk operations" are "bad". But are read(...) calls we're making against memory buffer would be considered "bad"? Time wasted in stream wrapper operations? Lazy flags checking / calculation? Etc, etc?

KOLANICH commented 6 years ago

That's radically different from classic definition of a side effect,

That's is. We are here worry not about effect of our code on something outside, but about effects of things outside of the code on code time.

"Algorithm complexity" is typically a metric, so I can't really imagine how you can define "everything expect a metric".

It's a mapping input parameters -> some number proportional to time. In our case input parameters is file and mapping is defined by the code generated by KSC from ksy. "side effects" would be additional expressions involving additional variables which are not in input parameters inserted into this function. The good thing about this metrics without any side effects is that it should be the same for all classical computers in any time. The bad thing is that it is impossible to achieve, of course.

For example, you say that "disk operations" are "bad". But are read(...) calls we're making against memory buffer would be considered "bad"? Time wasted in stream wrapper operations? Lazy flags checking / calculation? Etc, etc?

It depends on what we wanna measure. If we wanna measure efficiency of generated code alone it is bad, if we wanna measure efficiency of code with respect to interpreter and runtime it is good. If we wanna measure how well generated code utilises disk cache, reading from disk is good, but if I wanna measure the overhead my refactored version of runtime has (my current task, I have remeasured on synthetic test (reading u2 lot of times) and found that my version is a bit faster which is strange, the last time I have measured (a year ago) it it was a bit slower, the benchmark code is https://gist.github.com/KOLANICH/17df3ff8a966f463a5b5790c79a62c76 , the results are strange, but ureading u2 says nothing about overall performance impact on reading complex file formats

) it is bad. So we wanna to be able to remove "side effects" which are not needed in our experiment.