ptitSeb / box64

Box64 - Linux Userspace x86_64 Emulator with a twist, targeted at ARM64 Linux devices
https://box86.org
MIT License
3.73k stars 267 forks source link

Error time to purge ymm #1759

Closed Da1L8-X closed 1 month ago

Da1L8-X commented 1 month ago

When I try to translate the following code:

 8f6088:    c5 f4 58 08             vaddps (%rax),%ymm1,%ymm1
  8f608c:   48 83 c0 20             add    $0x20,%rax
  8f6090:   48 39 f8                cmp    %rdi,%rax
  8f6093:   75 f3                   jne    8f6088 <_ZN3sumclIN5xsimd3avxEfEET0_T_PKS3_j+0x28>

I get the result:

0x1008f6088: C5 F4 58 08  VADDPS Gx, Vx, Ex
0xffff863328a0: 5 emitted opcodes, inst=11, barrier=0 state=0/1(1), set=0/0, use=0, need=0/0, sm=0(0/0), pred=10/14 Q1:XMM1 Q8:YMM1 (Change: V8:->YMM1) ymmUsed=0002 ymm0=(0000/0000+0000-0002=0000) purgeYmm=0002
3dc00158        LDR Q24, [xRAX]
4e38d421        VFADD V1.4S, V1.4S, V24.4S
3dc06c08        LDR Q8, [xEmu, 0x1b0]
3dc00558        LDR Q24, [xRAX, 0x10]
4e38d508        VFADD V8.4S, V8.4S, V24.4S
New Instruction x64:0x1008f608c, native:0xffff863328b4
0x1008f608c: 48 83 C0 20  ADD Ed, Ib
0xffff863328b4: 1 emitted opcodes, inst=12, barrier=0 state=3/1(1), set=3F/0, use=0, need=0/0, sm=0(0/0), pred=11 Q1:XMM1 Q8:YMM1
9100814a        ADD xRAX, xRAX, 0x20
New Instruction x64:0x1008f6090, native:0xffff863328b8
0x1008f6090: 48 39 F8  CMP Ed, Gd
0xffff863328b8: 3 emitted opcodes, inst=13, barrier=0 state=3/1(1), set=3F/8, use=0, need=0/8, sm=0(0/0), pred=12 Q1:XMM1 Q8:YMM1
eb110145        SUBS x5, xRAX, xRDI
1a9f17e4        CSET w4,cEQ
331a009a        BFI wFlags, w4, 6, 1
New Instruction x64:0x1008f6093, native:0xffff863328c4
0x1008f6093: 75 F3  JNZ ib
0xffff863328c4: 4 emitted opcodes, inst=14, barrier=0 state=0/1(1), set=0/0, use=8, need=8/0, sm=0(0/0), pred=13, jmp=11 Q1:XMM1 (Change: V8:YMM1->)
721a035f        TST wFlags, 0x40
54fffea0          B.cEQ #+-11i ; 0x400034a42b28
Purge YMM mask=0002 --------
3d806c08        STR Q8, [xEmu, 0x1b0]
---------- Purge YMM

So I want to say: if JNZ dont store ymm before branch back, it will get Q8 from origin address again!

ptitSeb commented 1 month ago

Mmm, yeah, your are correct. I need to check that. I guess the unwind there doesn't detect that YMM1 is new and fetched at this instruction. Q8 should be "purged" at the JZ. Issue is probably because only the upper part of YMM1 is new, the lower part is already in the cache.

(I edited your post to put the logs in code marks, for readability)

Da1L8-X commented 1 month ago

As an additional point, before vaddps is a vxorps, only the low level is cleared, but according to x86 semantics, the high level is also cleared, however there is no register reading operation in box64. I don't know if it needs to be modified here.

  8f6081:   c5 f0 57 c9             vxorps %xmm1,%xmm1,%xmm1
  8f6085:   0f 1f 00                nopl   (%rax)
  8f6088:   c5 f4 58 08             vaddps (%rax),%ymm1,%ymm1
0x1008f6081: C5 F0 57 C9  VXORPS Gx, Vx, Ex
0xffff86332890: 2 emitted opcodes, inst=9, barrier=0 state=0/1(1), set=0/0, use=0, need=0/0, sm=0(0/0), pred=8, last_ip=0x1008f6060 Q1:XMM1 (Change: V1:->XMM1) ymm0=(0000/0000+0002-0000=0002)
        3dc02c01        LDR Q1, [xEmu, 0xb0]
        6e211c21        VEOR Q1, Q1, Q1
New Instruction x64:0x1008f6085, native:0xffff86332898
0x1008f6085: 0F 1F 00  NOP (multibyte)
0xffff86332898: 2 emitted opcodes, inst=10, barrier=0 state=0/1(1), set=0/0, use=0, need=0/0, sm=0(0/0), pred=9, last_ip=0x1008f6060 Q1:XMM1 ymm0=(0002/0002+0000-0000=0002)
Purge YMM mask=0002 --------
        91068001        ADD x1, xEmu, 0x1a0
        a9017c3f        STP xZR, xZR, [x1, 0x10]
---------- Purge YMM
ptitSeb commented 1 month ago

As an additional point, before vaddps is a vxorps, only the low level is cleared, but according to x86 semantics, the high level is also cleared, however there is no register reading operation in box64. I don't know if it needs to be modified here.

  8f6081: c5 f0 57 c9             vxorps %xmm1,%xmm1,%xmm1
  8f6085: 0f 1f 00                nopl   (%rax)
  8f6088: c5 f4 58 08             vaddps (%rax),%ymm1,%ymm1
0x1008f6081: C5 F0 57 C9  VXORPS Gx, Vx, Ex
0xffff86332890: 2 emitted opcodes, inst=9, barrier=0 state=0/1(1), set=0/0, use=0, need=0/0, sm=0(0/0), pred=8, last_ip=0x1008f6060 Q1:XMM1 (Change: V1:->XMM1) ymm0=(0000/0000+0002-0000=0002)
        3dc02c01        LDR Q1, [xEmu, 0xb0]
        6e211c21        VEOR Q1, Q1, Q1
New Instruction x64:0x1008f6085, native:0xffff86332898
0x1008f6085: 0F 1F 00  NOP (multibyte)
0xffff86332898: 2 emitted opcodes, inst=10, barrier=0 state=0/1(1), set=0/0, use=0, need=0/0, sm=0(0/0), pred=9, last_ip=0x1008f6060 Q1:XMM1 ymm0=(0002/0002+0000-0000=0002)
Purge YMM mask=0002 --------
        91068001        ADD x1, xEmu, 0x1a0
        a9017c3f        STP xZR, xZR, [x1, 0x10]
---------- Purge YMM

This part is correctly handled. High part are often zero'd, so box64 use a cache system to not do that all the time It's the Ymm0 part of the state. It's a simple mask, to know wich of the 16 upper ymm part is zero. And as you can see, the next opcode "purge" this information to memory with the STP xZR, xZR....

ptitSeb commented 1 month ago

I pushed something, can you check if it solved the issue on your side?

Da1L8-X commented 1 month ago

yes, it works, thx!

ptitSeb commented 1 month ago

And thx for the ticket, that was some good debug info :D !