I want to add sse4 opcode but need some help.Thank you!

Mr-HappyLI commented 3 years ago

Since many programs need to use SSE4, I want to add some operations. Like this:

                case 0x15://Tony SSE4 2020-11-12
                    INST_NAME("BLENDVPD xmm1, xmm2/m128, XMM_ZERO"); //Variable Blend Packed Double-FP Values
                    /**********************************************************************
                    MASK ← XMM0
                    IF (MASK[63] = 0) THEN DEST[63:0] ← DEST[63:0]
                            ELSE DEST [63:0] ← SRC[63:0] FI
                    IF (MASK[127] = 0) THEN DEST[127:64] ← DEST[127:64]
                            ELSE DEST [127:64] ← SRC[127:64] FI
                    DEST[MAXVL-1:128] (Unmodified)
                    **************************************************************************/

                    v0 = sse_get_reg(dyn, ninst, x1, 0);//XMM0(MASK)

                    //ModRM:reg (r, w)
                    nextop = F8;
                    v1 = sse_get_reg(dyn, ninst, x2, nextop&7);

                    // op2 ModRM:r/m (r)
                    nextop = F8;
                    if((nextop&0xC0)==0xC0) {
                        v2 = sse_get_reg(dyn, ninst, x3, nextop&7);
                        VMOVQ(v1, v2);
                    } else {
                        addr = geted(dyn, addr, ninst, nextop, &ed, x3, &fixedaddress, 4095, 0); //???
                        //LDRD also have alignment requirements
                        LDR_IMM9(x2, ed, fixedaddress+0);
                        LDR_IMM9(x3, ed, fixedaddress+4);
                        VMOVtoV_D(v1, x2, x3);
                        LDR_IMM9(x2, ed, fixedaddress+8);
                        LDR_IMM9(x3, ed, fixedaddress+12);
                        VMOVtoV_D(v1+1, x2, x3);
                    }
                    break;

How do I judge the value of V0?And how to write a self constructed immediate value or other ways to assign value to V1 . Thank you！

ptitSeb commented 3 years ago

Before trying the dynarec part, I suggest you first implement the interpretor part. It's easier and allow to have a reference build (so inside x86run660f.c)

Now, in your code: q0 = sse_get_reg(dyn, ninst, x1, 0);//XMM0(MASK) will indeed represent XMM0/MASK (note I renamed v0 as q0) But your handling of ModRM is wrong. Use the predefined macro, it easier, and there are here for that!:

            nextop = F8;
            GETGX(v0); // ModRM:reg
            GETEX(v1); // ModRM:r/w

Now you are ready to unrol the opcode. q0 is XMM0, v0 is reg "xmm1" and v1 is r/w "xmm2" Now that opcode is not trivial. idealy, you should not tranfert values from MASK to some general ARM register (as this is a long process), and just use Vxxx opcode. But again, write the interpretor version first, that help really understand what the opcode is doing.

Mr-HappyLI commented 3 years ago

Thank you! A few days ago, I added several opcodes, and the program is ready to use, but I'm afraid there are still errors. How can I check or submit the code?

ptitSeb commented 3 years ago

That's the tricky part. Unless you specificaly write a unit test (like the one for mmx test), there isn't much thing. You need to run program that use the opcodes, check the behavour is correct. Submit: use a github Pull Request (you'll need to fork box86 on your account for that).

Mr-HappyLI commented 3 years ago

@ptitSeb Hi,I have some doubts.Can you help me? In dynarec_arm_660f.c

                    INST_NAME("PMADDUBSW Gx,Ex");
                    nextop = F8;
                    GETGX(q0);
                    GETEX(q1);
                    v0 = fpu_get_scratch_quad(dyn);
                    v1 = fpu_get_scratch_quad(dyn);
                    VMOVL_U8(v0, q0+0);   // this is unsigned, so 0 extended
                    VMOVL_S8(v1, q1+0);   // this is signed
                    VMULQ_16(v0, v0, v1);
                    VPADDLQ_S16(v0, v0);
                    VQMOVN_S32(q0+0, v0);
                    VMOVL_U8(v0, q0+1);   // this is unsigned
                    VMOVL_S8(v1, q1+1);   // this is signed
                    VMULQ_16(v0, v0, v1);
                    VPADDLQ_S16(v0, v0);
                    VQMOVN_S32(q0+1, v0);

Does q0 refer to XMM register?Is it 128 bit? What is the difference between q0 and v0? If q0 refer to 128 bit register, what does "q0 + 1" mean? Thank you!

ptitSeb commented 3 years ago

so, q0, v0 are just int. The point is: ARM NEON have 32 double precision register (d0..d31), that can also be viewed as 16 quad-precision regs (q0..q15). SSE is 16 queq precision register xmm0..xmm15 the GETGX / GETEX macro gives a mapping of the XMM register used to the actual NEON register. The number given is the dXX neon register, even if it's a quad. The reason is all VxxxxQ NEON ARM emiter use the 1st dXX reg number anyway

In the case of VMOVL_U8(v0, q0+1); what is happening is that opcode take a double and expand it to a quad. I'm using "Intel" kind of notation, so here: you have Quad(v0) <- Double(q0+1) That opcode take each uint8_t of the 64bits of Double(q0+1), extend it to uint16_t, then create a Quad(v0)

Is that more clear?

Mr-HappyLI commented 3 years ago

@ptitSeb I still have some questions. Why q0 refer to Dxx neon register and v0 refer to Qxx neon register？ Is it because of the difference between "fpu_get_scratch_quad" and "fpu_get_reg_quad"?Or because "VMOVL_U8"?

Thank you ptitSeb!

ptitSeb commented 3 years ago

It's only because of VMOVL_U8

fpu_get_scratch_quad and fpu_get_reg_quad both reserve a quad register. But depending on what the opcode is expected, it can be Dxx or Q(xx/2). It's a bit tricky, but you get use to it after a while.

ptitSeb commented 3 years ago

For example, let's say fpu_get_reg_quad(...) return 2. That means D2 and D3 are reserved for this opcode as a scratch register. D2/D3 can also be accessed as Q1. Let's imagine fpu_get_scratch_quad(...) return 8. That means the xmm register used is mapped to D8..D9 (or Q4 in neon notation). The regular ARM asm notation of the vmovl would be vmovl.u8 q1, d8 but on box86 notation, that will be VMOVL_U8(2, 8) Clear now?

Mr-HappyLI commented 3 years ago

@ptitSeb
I got it! Thank you! So, as you mean, I'm doing the following, right?

                case 0x14://Tony SSE4 2020-11-17
                    INST_NAME("BLENDVPD xmm1, xmm2/m128, XMM_ZERO"); //Variable Blend Packed Single-FP Values
                    /**********************************************************************
                    MASK ← XMM0
                    IF (MASK[31] = 0) THEN DEST[31:0] ← DEST[31:0]
                            ELSE DEST [31:0] ← SRC[31:0] FI
                    IF (MASK[63] = 0) THEN DEST[63:32] ← DEST[63:32]
                            ELSE DEST [63:32] ← SRC[63:32] FI
                    IF (MASK[95] = 0) THEN DEST[95:64] ← DEST[95:64]
                            ELSE DEST [95:64] ← SRC[95:64] FI
                    IF (MASK[127] = 0) THEN DEST[127:96] ← DEST[127:96]
                            ELSE DEST [127:96] ← SRC[127:96] FI
                    DEST[MAXVL-1:128] (Unmodified)
                    **************************************************************************/

                    INST_NAME("PBLENDVB xmm1, xmm2/m128, <XMM0>"); //Variable Blend Packed Bytes
                    nextop = F8;
                    GETGX(q1); // ModRM:reg
                    GETEX(q2); // ModRM:r/w
                    q0 = sse_get_reg(dyn, ninst, x1, 0);//XMM0(MASK)

                    v0 = fpu_get_scratch_quad(dyn);

                    VSHR_U32(v0,q0+0,31);//>>31
                    VMUL_32(q1+0, v0, q2+0);

                    VSHR_U32(v0,q0+1,31);
                    VMUL_32(q1+1, v0, q2+1);
                    break;

ptitSeb commented 3 years ago

I don't think that will do what you expect. the VMUL will do:

 IF (MASK[31] = 0) THEN DEST[31:0] ← 0
                            ELSE DEST [31:0] ← SRC[31:0] FI

I think this one needs some VTBLX instead

ptitSeb commented 3 years ago

Again, this is a tricky opcode. Did you write the interpretor version first?

ptitSeb commented 3 years ago

(because you don't have to write the Dynarec version, the interpretor version is enough to get stuff running. It will be slower, but that's a start)

Mr-HappyLI commented 3 years ago

I'm sorry because I don't know the interpretor version. Are there any examples in engineering that can be learned? I will write the interpretor version first.Thank you!

ptitSeb commented 3 years ago

The interpretor version is in src/emu/x86run660f.c and it's eithier to handle. Look at line 291 to see where to insert the code.

Mr-HappyLI commented 3 years ago

@ptitSeb Hi，I've added some operations,like this:

                case 0x21://Tony SSE4 OP:66 0f 38 21 /r PMOVSXBD xmm1, xmm2/m32
                    INST_NAME("PMOVSXBD Vdq,Mq"); //Packed Move with Sign Extend
                    //INST_NAME("PMOVSXBD Vdq,Udq");
                    /****************************************************
                        DEST[31:0] ←SignExtend(SRC[7:0]);
                        DEST[63:32] ←SignExtend(SRC[15:8]);
                        DEST[95:64] ←SignExtend(SRC[23:16]);
                        DEST[127:96] ←SignExtend(SRC[31:24]);
                     ****************************************************/
                    nextop = F8;
                    GETGX(v0); // ModRM:reg
                    GETEX(v1); // ModRM:r/w
                    q0 = fpu_get_scratch_quad(dyn);
                    VMOVL_S8(q0, v1);
                    VMOVL_S8(v0, q0);
                    break;

And some interpretor versions that I don't know how to write dynarec version,like this:

            case 0x10://Tony SSE4  PBLENDVB xmm1, xmm2/m128, <XMM0>
                nextop = F8;
                GET_EX;
                eax1 = emu->xmm[0];//xmm0
                for (int i=0; i<16; ++i)
                {
//                    if(eax1.sb[i]&0x80)
//                    {
//                        GX.sb[i] = (EX->sb[i]);
//                    }
                    GX.sb[i] = (EX->sb[i])*((eax1.sb[i]>>7)&0x1);
                }
                break;

Are they right?Thank you!

ptitSeb commented 3 years ago

For the first block: the 2nd VMOVL_S8 should be VMOVL_S16. Also in case of mem, the GETEX(v1) will read the full 128bits, were only 32bits are needed. Not sure it can be problem.

                case 0x21://Tony SSE4 OP:66 0f 38 21 /r PMOVSXBD xmm1, xmm2/m32
                    INST_NAME("PMOVSXBD Vdq,Mq"); //Packed Move with Sign Extend
                    //INST_NAME("PMOVSXBD Vdq,Udq");
                    /****************************************************
                        DEST[31:0] ←SignExtend(SRC[7:0]);
                        DEST[63:32] ←SignExtend(SRC[15:8]);
                        DEST[95:64] ←SignExtend(SRC[23:16]);
                        DEST[127:96] ←SignExtend(SRC[31:24]);
                     ****************************************************/
                    nextop = F8;
                    GETGX(v0); // ModRM:reg
                    GETEX(v1); // ModRM:r/w
                    q0 = fpu_get_scratch_quad(dyn);
                    VMOVL_S8(q0, v1);
                    VMOVL_S16(v0, q0);
                    break;

For the second block, my understanding is that the opcode should be indeed

                    if(eax1.sb[i]&0x80)
                   {
                        GX.sb[i] = (EX->sb[i]);
                    }

but again, that cannot be put in the simple multiply you put. I you aboslutly want a multiply, it would be more like GX.sb[i] = (EX->sb[i])*((eax1.sb[i]>>7)&0x1) + (GX.sb[i])*(1-((eax1.sb[i]>>7)&0x1)));

Mr-HappyLI commented 3 years ago

@ptitSeb Thank you! I've fixed my mistake. Now,I've been confused about CRC32 for several days. Like this: F2 0F 38 F1 C1 8B 8E A8 00 ..... Can you help me?

ptitSeb commented 3 years ago

Help you with what? This opcode use general register, so in interpretor you need to use GET_ED macro. So you start with

            nextop = F8;
            GET_ED;

After that GD.dword[0] is the "destination", and ED->dword[0] is the source. The crc32 opcode itself is pretty complex, with some function to write for the BIT_REFLECT32 part (according to this https://www.felixcloutier.com/x86/crc32 ), so writting a utility function to compute the result is probably needed there.

Mr-HappyLI commented 3 years ago

Hi，@ptitSeb Sorry, I may not have described it clearly. I am confused about how to distinguish m16 and m32 from the same opcode,like this:

F2 0F 38 F1 /r CRC32 r32, r/m16	RM	Valid	Valid	Accumulate CRC32 on r/m16.
F2 0F 38 F1 /r CRC32 r32, r/m32	RM	Valid	Valid	Accumulate CRC32 on r/m32.

Another problem is that there are crc32c instructions in arm. There is a difference between the crc32c instruction and the x86 instruction, polynomial 0X11EDC6F41 and 0x1EDC6F41. What is the difference between them? Thank you!

ptitSeb commented 3 years ago

The m16 or m32 depend on the type of segment selector actualy run from, and the prefix. We are running from 32bits segment here, so it's m32. Unless a 66 or 67 prefix is used.

The arm crc32c is not available on every processor, so a test must be done before using. Also, if it's different, it's different, donc use it.

Mr-HappyLI commented 3 years ago

F2 0F 38 F0 /r CRC32 r32, r/m8	RM	Valid	Valid	Accumulate CRC32 on r/m8.

So,this order is m8 or m32？

ptitSeb commented 3 years ago

m8 stays m8, no mater the size of the segment selector or 66/67 prefix.

Mr-HappyLI commented 3 years ago

Thank you!

ptitSeb commented 1 year ago

Converting this to a discussion.

ptitSeb / box86

I want to add sse4 opcode but need some help.Thank you! #248