Closed Mr-HappyLI closed 1 year ago
Before trying the dynarec part, I suggest you first implement the interpretor part. It's easier and allow to have a reference build (so inside x86run660f.c
)
Now, in your code:
q0 = sse_get_reg(dyn, ninst, x1, 0);//XMM0(MASK)
will indeed represent XMM0/MASK (note I renamed v0 as q0)
But your handling of ModRM is wrong. Use the predefined macro, it easier, and there are here for that!:
nextop = F8;
GETGX(v0); // ModRM:reg
GETEX(v1); // ModRM:r/w
Now you are ready to unrol the opcode. q0 is XMM0, v0 is reg "xmm1" and v1 is r/w "xmm2" Now that opcode is not trivial. idealy, you should not tranfert values from MASK to some general ARM register (as this is a long process), and just use Vxxx opcode. But again, write the interpretor version first, that help really understand what the opcode is doing.
Thank you! A few days ago, I added several opcodes, and the program is ready to use, but I'm afraid there are still errors. How can I check or submit the code?
That's the tricky part. Unless you specificaly write a unit test (like the one for mmx test), there isn't much thing. You need to run program that use the opcodes, check the behavour is correct. Submit: use a github Pull Request (you'll need to fork box86 on your account for that).
@ptitSeb Hi,I have some doubts.Can you help me? In dynarec_arm_660f.c
INST_NAME("PMADDUBSW Gx,Ex");
nextop = F8;
GETGX(q0);
GETEX(q1);
v0 = fpu_get_scratch_quad(dyn);
v1 = fpu_get_scratch_quad(dyn);
VMOVL_U8(v0, q0+0); // this is unsigned, so 0 extended
VMOVL_S8(v1, q1+0); // this is signed
VMULQ_16(v0, v0, v1);
VPADDLQ_S16(v0, v0);
VQMOVN_S32(q0+0, v0);
VMOVL_U8(v0, q0+1); // this is unsigned
VMOVL_S8(v1, q1+1); // this is signed
VMULQ_16(v0, v0, v1);
VPADDLQ_S16(v0, v0);
VQMOVN_S32(q0+1, v0);
Does q0 refer to XMM register?Is it 128 bit? What is the difference between q0 and v0? If q0 refer to 128 bit register, what does "q0 + 1" mean? Thank you!
so, q0, v0 are just int. The point is: ARM NEON have 32 double precision register (d0..d31), that can also be viewed as 16 quad-precision regs (q0..q15). SSE is 16 queq precision register xmm0..xmm15 the GETGX / GETEX macro gives a mapping of the XMM register used to the actual NEON register. The number given is the dXX neon register, even if it's a quad. The reason is all VxxxxQ NEON ARM emiter use the 1st dXX reg number anyway
In the case of VMOVL_U8(v0, q0+1);
what is happening is that opcode take a double and expand it to a quad. I'm using "Intel" kind of notation, so here: you have Quad(v0) <- Double(q0+1)
That opcode take each uint8_t
of the 64bits of Double(q0+1)
, extend it to uint16_t
, then create a Quad(v0)
Is that more clear?
@ptitSeb I still have some questions. Why q0 refer to Dxx neon register and v0 refer to Qxx neon register? Is it because of the difference between "fpu_get_scratch_quad" and "fpu_get_reg_quad"?Or because "VMOVL_U8"?
Thank you ptitSeb!
It's only because of VMOVL_U8
fpu_get_scratch_quad
and fpu_get_reg_quad
both reserve a quad register. But depending on what the opcode is expected, it can be Dxx or Q(xx/2). It's a bit tricky, but you get use to it after a while.
For example, let's say fpu_get_reg_quad(...)
return 2. That means D2 and D3 are reserved for this opcode as a scratch register. D2/D3 can also be accessed as Q1.
Let's imagine fpu_get_scratch_quad(...)
return 8. That means the xmm register used is mapped to D8..D9 (or Q4 in neon notation).
The regular ARM asm notation of the vmovl would be vmovl.u8 q1, d8
but on box86 notation, that will be VMOVL_U8(2, 8)
Clear now?
@ptitSeb
I got it! Thank you!
So, as you mean, I'm doing the following, right?
case 0x14://Tony SSE4 2020-11-17
INST_NAME("BLENDVPD xmm1, xmm2/m128, XMM_ZERO"); //Variable Blend Packed Single-FP Values
/**********************************************************************
MASK ← XMM0
IF (MASK[31] = 0) THEN DEST[31:0] ← DEST[31:0]
ELSE DEST [31:0] ← SRC[31:0] FI
IF (MASK[63] = 0) THEN DEST[63:32] ← DEST[63:32]
ELSE DEST [63:32] ← SRC[63:32] FI
IF (MASK[95] = 0) THEN DEST[95:64] ← DEST[95:64]
ELSE DEST [95:64] ← SRC[95:64] FI
IF (MASK[127] = 0) THEN DEST[127:96] ← DEST[127:96]
ELSE DEST [127:96] ← SRC[127:96] FI
DEST[MAXVL-1:128] (Unmodified)
**************************************************************************/
INST_NAME("PBLENDVB xmm1, xmm2/m128, <XMM0>"); //Variable Blend Packed Bytes
nextop = F8;
GETGX(q1); // ModRM:reg
GETEX(q2); // ModRM:r/w
q0 = sse_get_reg(dyn, ninst, x1, 0);//XMM0(MASK)
v0 = fpu_get_scratch_quad(dyn);
VSHR_U32(v0,q0+0,31);//>>31
VMUL_32(q1+0, v0, q2+0);
VSHR_U32(v0,q0+1,31);
VMUL_32(q1+1, v0, q2+1);
break;
I don't think that will do what you expect. the VMUL will do:
IF (MASK[31] = 0) THEN DEST[31:0] ← 0
ELSE DEST [31:0] ← SRC[31:0] FI
I think this one needs some VTBLX instead
Again, this is a tricky opcode. Did you write the interpretor version first?
(because you don't have to write the Dynarec version, the interpretor version is enough to get stuff running. It will be slower, but that's a start)
I'm sorry because I don't know the interpretor version. Are there any examples in engineering that can be learned? I will write the interpretor version first.Thank you!
The interpretor version is in src/emu/x86run660f.c
and it's eithier to handle. Look at line 291 to see where to insert the code.
@ptitSeb Hi,I've added some operations,like this:
case 0x21://Tony SSE4 OP:66 0f 38 21 /r PMOVSXBD xmm1, xmm2/m32
INST_NAME("PMOVSXBD Vdq,Mq"); //Packed Move with Sign Extend
//INST_NAME("PMOVSXBD Vdq,Udq");
/****************************************************
DEST[31:0] ←SignExtend(SRC[7:0]);
DEST[63:32] ←SignExtend(SRC[15:8]);
DEST[95:64] ←SignExtend(SRC[23:16]);
DEST[127:96] ←SignExtend(SRC[31:24]);
****************************************************/
nextop = F8;
GETGX(v0); // ModRM:reg
GETEX(v1); // ModRM:r/w
q0 = fpu_get_scratch_quad(dyn);
VMOVL_S8(q0, v1);
VMOVL_S8(v0, q0);
break;
And some interpretor versions that I don't know how to write dynarec version,like this:
case 0x10://Tony SSE4 PBLENDVB xmm1, xmm2/m128, <XMM0>
nextop = F8;
GET_EX;
eax1 = emu->xmm[0];//xmm0
for (int i=0; i<16; ++i)
{
// if(eax1.sb[i]&0x80)
// {
// GX.sb[i] = (EX->sb[i]);
// }
GX.sb[i] = (EX->sb[i])*((eax1.sb[i]>>7)&0x1);
}
break;
Are they right?Thank you!
For the first block: the 2nd VMOVL_S8 should be VMOVL_S16. Also in case of mem, the GETEX(v1) will read the full 128bits, were only 32bits are needed. Not sure it can be problem.
case 0x21://Tony SSE4 OP:66 0f 38 21 /r PMOVSXBD xmm1, xmm2/m32
INST_NAME("PMOVSXBD Vdq,Mq"); //Packed Move with Sign Extend
//INST_NAME("PMOVSXBD Vdq,Udq");
/****************************************************
DEST[31:0] ←SignExtend(SRC[7:0]);
DEST[63:32] ←SignExtend(SRC[15:8]);
DEST[95:64] ←SignExtend(SRC[23:16]);
DEST[127:96] ←SignExtend(SRC[31:24]);
****************************************************/
nextop = F8;
GETGX(v0); // ModRM:reg
GETEX(v1); // ModRM:r/w
q0 = fpu_get_scratch_quad(dyn);
VMOVL_S8(q0, v1);
VMOVL_S16(v0, q0);
break;
For the second block, my understanding is that the opcode should be indeed
if(eax1.sb[i]&0x80)
{
GX.sb[i] = (EX->sb[i]);
}
but again, that cannot be put in the simple multiply you put.
I you aboslutly want a multiply, it would be more like GX.sb[i] = (EX->sb[i])*((eax1.sb[i]>>7)&0x1) + (GX.sb[i])*(1-((eax1.sb[i]>>7)&0x1)))
;
@ptitSeb Thank you! I've fixed my mistake. Now,I've been confused about CRC32 for several days. Like this: F2 0F 38 F1 C1 8B 8E A8 00 ..... Can you help me?
Help you with what?
This opcode use general register, so in interpretor you need to use GET_ED
macro. So you start with
nextop = F8;
GET_ED;
After that GD.dword[0]
is the "destination", and ED->dword[0]
is the source.
The crc32 opcode itself is pretty complex, with some function to write for the BIT_REFLECT32 part (according to this https://www.felixcloutier.com/x86/crc32 ), so writting a utility function to compute the result is probably needed there.
Hi,@ptitSeb Sorry, I may not have described it clearly. I am confused about how to distinguish m16 and m32 from the same opcode,like this:
F2 0F 38 F1 /r CRC32 r32, r/m16 | RM | Valid | Valid | Accumulate CRC32 on r/m16. |
---|---|---|---|---|
F2 0F 38 F1 /r CRC32 r32, r/m32 | RM | Valid | Valid | Accumulate CRC32 on r/m32. |
Another problem is that there are crc32c instructions in arm. There is a difference between the crc32c instruction and the x86 instruction, polynomial 0X11EDC6F41 and 0x1EDC6F41. What is the difference between them? Thank you!
The m16 or m32 depend on the type of segment selector actualy run from, and the prefix. We are running from 32bits segment here, so it's m32. Unless a 66 or 67 prefix is used.
The arm crc32c is not available on every processor, so a test must be done before using. Also, if it's different, it's different, donc use it.
F2 0F 38 F0 /r CRC32 r32, r/m8 | RM | Valid | Valid | Accumulate CRC32 on r/m8. |
---|
So,this order is m8 or m32?
m8 stays m8, no mater the size of the segment selector or 66/67 prefix.
Thank you!
Converting this to a discussion.
Since many programs need to use SSE4, I want to add some operations. Like this:
How do I judge the value of V0?And how to write a self constructed immediate value or other ways to assign value to V1 . Thank you!