[RV64_DYNAREC] Added vector SEW cache

ptitSeb / box64

Box64 - Linux Userspace x86_64 Emulator with a twist, targeted at ARM64 Linux devices

https://box86.org

MIT License

3.73k stars 267 forks source link

[RV64_DYNAREC] Added vector SEW cache #1698

Closed ksco closed 2 months ago

ptitSeb commented 2 months ago

I'm unsure "reset" are handled correctly (so dyn->sew should be taken from dyn->insts[reset].sew)? Appart from that, it looks good.

ksco commented 2 months ago

Not sure what’s “reset”, and when we need to do reset?

ksco commented 2 months ago

Okay, I think you're referring to the insts[reset_n]: if there is a "reset" instruction for current insts[ninst], we need to stop being smart and always generate a vsetvli instruction, right?

ptitSeb commented 2 months ago

Okay, I think you're referring to the insts[reset_n]: if there is a "reset" instruction for current insts[ninst], we need to stop being smart and always generate a vsetvli instruction, right?

Not really. I mean, if reset_n == -2 then yes, but for other value, it's different.

It's handled in void fpu_reset_cache(dynarec_rv64_t* dyn, int ninst, int reset_n) from the rv64_dynarec_helper.c file, and it's used in the blocks 85-105 in dynarec_native_pass.c

This is to handle non-linear execution flow...

ksco commented 2 months ago

I understand we need this for fpu cache management, but for sew (which is simpler), why do we need to care about reset? Can you give an example?

ptitSeb commented 2 months ago

I understand we need this for fpu cache management, but for sew (which is simpler), why do we need to care about reset? Can you give an example?

It's a non-minear execution flow.... That means the intruction predecesser is NOT the previous opcode. See something like this:

xxxx JZ 1f
xxxx something
xxxx JMP 2b
1f: something else

in 1f, the flow is not linear, and the execution comes from the JZ 1f, and so, all current state needs to be reset from there, and not from the previous opcode.

ksco commented 2 months ago

But we did the sew change at JZ 1f, before the jump, in CacheTransform. So when the execution flow reached 1f, the SEW state was good.

ptitSeb commented 2 months ago

But we did the sew change at JZ 1f, before the jump, in CacheTransform. So when the execution flow reached 1f, the SEW state was good.

Not really. You need to think of it on the multiple passes. This is a look into the future... because 1f is a jump forward, so you need a previous pass to set the correct values.

You can just reset to nothing, no mater what, but it's a missed optimisation oportunity to do so, that the reset_n scheme is there to solve.

ksco commented 2 months ago

OK, so

for reset_n == -2, just gave up and reset it no matter what (I guess -2 is mapped to BARRIER_FULL where the predecessors are too complex for optimizing?)
for reset_n >= 0, do sew = insts[reset_n].sew in pass0 and do sew = insts[ninst].sew in other passes.

ptitSeb commented 2 months ago

reset == -2 is also used for CALLRET optimisation, to be sure the return opcode doesn't have any optimisation left.

For the other case, copy what the ARM64 backend is doing:

    #if STEP > 1
    // for STEP 2 & 3, just need to refrest with current, and undo the changes (push & swap)
    dyn->n = dyn->insts[ninst].n;
    dyn->ymm_zero = dyn->insts[ninst].ymm0_in;
    neoncacheUnwind(&dyn->n);
    #else
    dyn->n = dyn->insts[reset_n].n;
    dyn->ymm_zero = dyn->insts[reset_n].ymm0_out;
    #endif