m-j-w / CpuId.jl

Ask the CPU for cache sizes, SIMD feature support, a running hypervisor, and more.
Other
54 stars 10 forks source link

Performance number in README is completely wrong. #15

Closed yuyichao closed 7 years ago

yuyichao commented 7 years ago

The author appears to be aware of the agner table yet these numbers direct contradict it.

Partially copying my comment from https://github.com/JuliaLang/julia/issues/13901#issuecomment-298190768

For comparison, 100..200 CPU cycles is roughly loading one integer from main memory

This is roughly the latency of a cache miss, which will be hidden to a large extent on an OOO core. You should also be able to get a high cache hit rate (unless you're writing a GC, for example....). A typical cache hit takes a few cycles to tens of cycles (of latency) depending on which cache you hit. Also, this is the timing of loading one cacheline of 64bytes, which usually consists of a few integers that's being used in the code.

or one or two integer divisions.

That's the combined latency of ~5-10 integer division or 10-20 reciprocal throughput.

Calling any external library function is at least one order more cycles.

If it's the first time you are calling a function and it needs to resolve the symbol it can take this much. A normal function call is at least one order faster. Comparing to symbol resolution won't be fair since that's effectively including JIT time....

Even though the approach used in this package can be used to implement proper caching version for serious use, the misleading README section essentially recommend the opposite.

m-j-w commented 7 years ago

I guess we're having different use-cases in mind regarding on what we base our assumptions on. But I agree that the section is too easily misinterpreted, in particular if a user is not familiar with the details like what 'serialization' means. I'll update the text in the next release.

However, since we're on the topic of suggestions: clearly a global const defined at module initialization is the best choice anyway. But, what would you suggest to propose (as an example in the README) for an algorithmic dispatch e.g. based on SIMD properties? Say pick algorithm 'A' for 'AVX2', algorithm 'B' for 'SSE', otherwise 'C'? I'd probably go for a fn(::Val{:AVX}, xs...) dispatch etc.

yuyichao commented 7 years ago

I guess we're having different use-cases in mind regarding on what we base our assumptions on. But I agree that the section is too easily misinterpreted, in particular if a user is not familiar with the details like what 'serialization' means. I'll update the text in the next release.

Well, the first big issue is that the numbers are completely wrong. I can't think of a way most of they can be interpreted as correct. The user doesn't need to know what serialization means. He just need to know anything that takes more than 100 cycles are extremely slow for most things that cares about CPU dispatch.

I'd probably go for a fn(::Val{:AVX}, xs...) dispatch etc.

For threading the dispatch through generic code and reduce the number of dispatch, sure. Definitely not for dispatch itself.