mupen64plus / mupen64plus-core

Core module of the Mupen64Plus project
1.29k stars 257 forks source link

Proper Cycle Counting #953

Open Meerkov opened 2 years ago

Meerkov commented 2 years ago

It was pointed out that currently cycle counting is only an estimate.

I didn't hear any reason not to use proper opcode cycle counts, so that every game doesn't need to guesstimate it. These cycle count estimates cause unpredictable crashes. In my experience, usually around 10-15 minutes after play begins.

https://github.com/mupen64plus/mupen64plus-core/issues/952 https://github.com/mupen64plus/mupen64plus-core/issues/951 https://github.com/mupen64plus/mupen64plus-core/issues/209 etc. By my count, 79 issues reference CountPerOp, which would make it something like 10% of all bugs reported to this project. (More, since some might not mention that exact variable name).

If you point me where the cycle count is adjusted, I'll see if I can fix it.

mudlord commented 2 years ago

If you want to do that, also work on RDP freeze bit emulation.

Meerkov commented 2 years ago

@mudlord Unless it's a joke meant to discourage someone from working on this, you should file a different bug to track that issue.

mudlord commented 2 years ago

Certainly not a joke :). RDP freeze bit emulation is another thing needed for accurate timing. But certainly can indeed be in another bug report.

Meerkov commented 2 years ago

From the documentation, each instruction finishes in 1 cycle. https://hack64.net/docs/VR43XX.pdf Page 49

The VR4300 has a 5-stage instruction pipeline. This pipeline is used for floatingpoint operations as well as for integer operations. In a normal environment, the pipeline executes one instruction in 1 cycle.

However, the default setting is 2 cycles per opcode. That default seems strange to me.

loganmc10 commented 2 years ago

That is because you're not taking into account data transfers, stalls, cache misses, etc. It's a very complex thing, trust me, if it was easy, Project64, Mupen64plus, 1964, and almost every other N64 emulator throughout the years wouldn't be using this exact same hack.

CEN64 emulates the cycles accurately, if you're looking to work on a cycle-accurate emulator it's a good place to look. It's also very slow.

loganmc10 commented 2 years ago

There is at least 1 test ROM that I know of that is supposed to verify the timings: https://github.com/PeterLemon/N64/tree/master/CPUTest/CPU/TIMINGNTSC

If you've got an idea, you can always try that ROM and see if it passes

Meerkov commented 2 years ago

Thanks for the pointer! That seems very useful for this kind of thing.

Yup, I'm under no illusion that it will be trivial, but my hunch is that given about 79/320 games have reported bugs related to this estimate (and surely, that doesn't mean the others are immune to unpredictable crashes either) that at least getting closer to the true cycle count would result in increasingly more stable games.

Zapeth commented 2 years ago

There is at least 1 test ROM that I know of that is supposed to verify the timings: https://github.com/PeterLemon/N64/tree/master/CPUTest/CPU/TIMINGNTSC

Just wanted to say that if someone can execute this on real hardware, screenshots of the results would be appreciated. Also maybe run it multiple times, to ensure that there isn't a variance in the results (or if there is, the results could be used to estimate a range).

See also https://github.com/PeterLemon/N64/issues/17

m4xw commented 2 years ago

I am currently on this task, completed most base work for cached interpreter already

fraser125 commented 1 year ago

Hopefully I understand the problem correctly. The variance in the CountPerOp is related to how often the Scalar Multiply and Divide instructions are used. This can change even depending on the section of the game being played so a fixed value for this is a matter of getting lucky.

I don't understand the codebase well enough to make the changes especially with the recompiler etc, so I'll explain my proposal. The Count Register is an unsigned int. Every instruction cycle increases it by 1. When a multiply or divide instruction is executed, increment count based on the value in the datasheet.
MULT: 5 (or 4 if it's already incremented) MULTU: 5 (4 see above) DIV: 37 (36 see above) DIVU: 37 (36 see above) DMULT: 8 (7 see above) DMULTU:8 (7 see above) DDIV: 69 (68 see above) DDIV: 69 (68 see above) If you do these steps in different places in the code, simply subtract one from the above values and place them in the appropriate functions, shown in parens ().

Then the last step is when the COUNT register is read from return a 1 bit right shifted copy (because it only counts every other clock cycle).

I do not minimize the amount of work this could really be, it's highly dependent on the codebase and the developer. I only suggest this since it seems to impact about 25% of the total N64 library.

NOTE: For validation testing you might be able to just change the multiply and divide functions to increment the correct value and use CountPerOp = 1. If it proves to be a good solution you can remove some of the extra code for managing that config value.

@Meerkov : Listed above are the instructions that officially take more than 1 cycle. They are on page 76 of the datasheet you linked to.

The good news is that an emulator can create a dream world for the emulated system to run inside where 'data transfers stalls, cache misses, etc' don't actually happen. I'm sure someone can argue these should be implemented for "accuracy" but I'm not that person.

fraser125 commented 1 year ago

I just checked and Floating point instructions have more variable timings See Table 7-14 on page 233 of the previously linked datasheet https://hack64.net/docs/VR43XX.pdf

It is interesting to note that the Floating Point Multiply and Divide are faster by a few clock cycles.