google / gematria

Machine learning for machine code.
Apache License 2.0
76 stars 11 forks source link

semantics of inverse_throughput_cycles and prefix_inverse_throughput_cycles #179

Closed ronghongbo closed 2 months ago

ronghongbo commented 2 months ago

Hello, is it accurate to say that inverse_throughput_cycles refers to CPI (cycles per instruction)?

I do not understand the description of prefix_inverse_throughput_cycles in throughput.py: "The number of cycles of inverse throughput of the prefixes of the basic block.". There can be prefixes like REP for instructions, but what are the prefixes of a basic block?

Thanks! Hongbo

boomanaiden154 commented 2 months ago

It's not correct to say that inverse_throughput_cycles refers to CPI. It refers to the number of cycles needed per block execution (reciprocal of throughput), where the block is measured in steady state. I think the BHive/UICA papers have the exact formula (essentially doing two measurements, one twice as long as the other, and then finding the difference to remove startup effects).

The definition in the code could probably use a bit of improving though.

Not sure what prefix_inverse_throughput_cycles is used for. I believe it is referring to a previous block. @ondrasej will have more context on that. To me it kind of seems like support for an experiment at some point that didn't end up panning out (it doesn't show up in any of the importers), but I could be wrong there.

ronghongbo commented 2 months ago

I see. Thanks, Aiden!

ondrasej commented 2 months ago

I'm sorry for the late response.

I do not understand the description of prefix_inverse_throughput_cycles in throughput.py: "The number of cycles of inverse throughput of the prefixes of the basic block.". There can be prefixes like REP for instructions, but what are the prefixes of a basic block?

The basic block prefixes have nothing to do with instruction prefixes (REP, LOCK, ...). At one point, we tested the idea that the models would be easier to train if we had per-instruction contributions for each basic block, and we could use them all at the same time during the training process. So, instead of measuring only

        lea     eax, [rdi-1]
        imul    eax, edi
        shr     eax

and training on the block as a whole, we'd measure also

        lea     eax, [rdi-1]
        imul    eax, edi

and

        lea     eax, [rdi-1]

and take the increment from N instructions to N+1 as the inverse throughput contributed by the one additional instruction. We'd then train the model to predict the inverse throughput as a sum of these increments.

This was more interesting for the LSTM-based (Ithemal-like) models, because in theory it could allow them to generalize to much longer sequences of instructions than can handle now. But in the end, it didn't bring the expected improvements, and in practice, models trained only with the sum performed just as well, but they were faster to train. But we didn't remove the code, and it's still there for people to experiment with.

For the graph models (GRANITE), this approach (compute per-instruction increments, and train them only with the sum) proved to work the best as well.

Btw. I've opened the discussion section here in the repository. Feel free to use it for any other questions.

ronghongbo commented 2 months ago

Yes, this makes a lot of sense. Computing the delta contribution of each instruction should be able to enable longer sequences and more precise prediction. I guess this might be applied to identify performance bottlenecks, e.g. build a critical path.