noloader / POWER8-crypto

The unoffical guide to POWER8 in-core crypto
4 stars 0 forks source link

Const pointer performance #2

Closed wschmidt-ibm closed 6 years ago

wschmidt-ibm commented 6 years ago

I find this performance loss for const performance (end of chapter 2) to be quite mysterious. If I understand correctly, if you replace the second vec_ld argument with (const uint8_t*)mem_addr, there is performance loss? Is this true using the /opt/latest gcc compiler? Assuming so, could you please open a GCC bugzilla for this? The Power-specific code doesn't care about this, so I would assume the parser is somehow creating code that the optimizers don't handle well when the const token is present, and I can't see a valid reason for that.

noloader commented 6 years ago

@wschmidt-ibm ,

If I understand correctly, if you replace the second vec_ld argument with (const uint8_t*)mem_addr, there is performance loss?

Yes.

I used to write the function like:

uint8x16_p8 VectorLoad(const uint8_t* mem_addr, int offset)
{
    return vec_ld(offset, mem_addr);
}

And:

uint8x16_p8 VectorLoad(const uint8_t mem_addr[16], int offset)
{
    return vec_ld(offset, mem_addr);
}

The second one caused the compiler trouble and resulted in a panic (it was fixed last year). While trying to work around the panic I noticed the version below ran faster. This is the version that casts away constness:

uint8x16_p8 VectorLoad(const uint8_t* mem_addr, int offset)
{
    return vec_ld(offset, (uint8_t*)mem_addr);
}

Later, when I saw Jack Lloyd's implementation in Botan, I noticed he used the const version. I benchmarked the const and non-const versions and saw the speed-up when using the non-const version.

So I've seen the behavior in both Crypto++ and Botan.


Is this true using the /opt/latest gcc compiler?

I don't know if I observed it using gcc-latest. When I perform benchmarking I'll be sure to look at it next time and report back.

wschmidt-ibm commented 6 years ago

Thanks, I would appreciate that very much!

noloader commented 6 years ago

@wschmidt-ibm,

Is this true using the /opt/latest gcc compiler?

I don't know if I observed it using gcc-latest. When I perform benchmarking I'll be sure to look at it next time and report back.

OK, so I got to do some benchmarking last night. I did not observe the behavior with GCC 7.2.0.

I had trouble duplicating it with GCC 4.8.5 later in the evening. Earlier in the evening I could duplicate it like clockwork. I estimate 75% to 90% of the time. After around 12:00 AM or 1:00 AM EST I could not duplicate it regularly. The early morning hours dropped to about 50%.

I'm not sure what to make of the measurements. On one hand there is a measurable difference some of the time. On the other hand we don't know the events that influence the measurement. I suspect system load has something to do with it, but it is just a guess. Another open question, how does powersave mode influence the measurement.

I commented out the section since it is not present in GCC 7.2.0 and it is not as regular as I thought in GCC 4.8.5.

wschmidt-ibm commented 6 years ago

I suspect system load is the issue. There are a lot of folks in Europe that bang on that machine, and what you report seems to coincide with when the European folks would start toddling off to bed, and even the real night owls like Segher finally gave it up for the day. Under reasonably heavy load, your data is probably getting cast out of the dcache fairly often.

Thanks for commenting that out. I don't have any great theory why 7.2 vs. 4.8.5 is so different here, other than that 4.8.5 is quite a bad compiler compared to 7.2. It was the absolute first P8 support, and while the ISA was fully enabled, there was a LOT of work left to do on making it perform.

noloader commented 6 years ago

Also see Crypto++ | Commit 9a52edcfdb0f.

munroesj52 commented 4 years ago

Likely instruction fusion hit or miss due to changes in the surrounding code. POWER8 Processor User’s Manual for Single-Chip Module; Section 10.1.12 Instruction Fusion

It would also change with SMT level as the kernel folds/unfolds logical processor under load.