sifive / freedom-e-sdk

Open Source Software for Developing on the Freedom E Platform - Deprecated
Other
581 stars 209 forks source link

Where can I find the optimized version of strcmp()? #419

Open hyf6661669 opened 4 years ago

hyf6661669 commented 4 years ago

I read your blog here: https://www.sifive.com/blog/dhrystone-performance-tuning-on-the-freedom-platform. It is said that with the help of the optimized version of strcmp(), the dhrystone score could be improved a liittle bit. Is it opensource?

fuchingy commented 4 years ago

The one you're asking for is with the aid of a customized instruction, based on SiFive Custom Instruction Extension (SCIE). This version of strcmp is not available at the moment but will be available soon, as a demonstration of SCIE. I'll let you know when it is available.

brucehoult commented 4 years ago

I don't know what is intended for the SCIE demonstration. I do know that there are several different ways to make an instruction that will help strcmp() and similar C string functions and at least two such have been proposed for the draft BitManipulation standard extension.

A Working Group member from another company suggested a very specialized instruction "FFZMISM rd,r1,r2" which compares two registers and returns zero if all bytes of the registers match AND none of the bytes are zero. Otherwise it returns the bit offset of the first non-matching or zero byte.

You can use this in an unrolled loop like this:

LOOP: lw r1,(a0) lw r2,(a1) FFZMISM rd,r1,r2 bnez rd, MISMATCH lw r1,4(a0) lw r2,4(a1) FFZMISM rd,r1,r2 bnez rd, MISMATCH lw r1,8(a0) lw r2,8(a1) FFZMISM rd,r1,r2 bnez rd, MISMATCH lw r1,12(a0) lw r2,12(a1) add a1,a1,16 add a2,a2,16 FFZMISM rd,r1,r2 bz rd, LOOP MISMATCH: slli rd,rd,3 // convert bytes to bits srl r1,r1,rd srl r2,r2,rd andi r1,r1,0xff andi r2,r2,0xff sub a0,r1,r2 // strcmp result

It is also possible to use the much more general purpose GORC[I] which can specify subsets of bits within a register in a number of different ways -- each pair of adjacent bits, all the bits of each nybble, all the bits of each byte, each halfword, each word, or the Nth bit of each pair, nybble, byte, halfword, word (and many more) -- and OR together the bits of each subset and replace all of them with the result. GORC[I] is a very very cheap addition to the implementation of another highly desired instruction, GREV[I] (Generalized Reverse) which reverses the same subsets of bits that GORC[I] ORs together -- so it can reverse all the bits in a register, reverse the bits in each byte or halfword etc, or reverse the order of bytes (big endian/little endian) etc.

So .. GORCI rd,rs,7 sets each byte of the destination register to all 1s if any bit of that byte is set i.e. if it is non-zero.

With GORCI each iteration of the loop body would become:

lw r1,4(a0) lw r2,4(a1) GORCI rd,r1,7 bne rd, MINUSONE, ZERO_FOUND bne r1, r2, MISMATCH

The termination becomes something like:

MISMATCH: xor rd,r1,r2 GORCI rd,rd,7 ZERO_FOUND: xori rd,rd,-1 ctz rd,rd # Count Trailing Zeros, another BiManip extension instruction ... and then the same code as the FFZMISM version

So the GORCI version needs a few more instructions in the termination. It also needs one more instruction in each iteration of the loop body but the FFZMISM version has a pipeline stall waiting for r2 to load on many micro-architectures while the GORCI version is stall-free, so they actually execute in the same number of clock cycles. Overall the improvement to Dhrystone (and anything using C string manipulation extensively) is about the same with either new instruction. GORCI however is much more general purpose, so I think it is probably what will end up in the final BitManip standard extension sometime next year.