Closed ronghongbo closed 2 months ago
It's not correct to say that inverse_throughput_cycles
refers to CPI. It refers to the number of cycles needed per block execution (reciprocal of throughput), where the block is measured in steady state. I think the BHive/UICA papers have the exact formula (essentially doing two measurements, one twice as long as the other, and then finding the difference to remove startup effects).
The definition in the code could probably use a bit of improving though.
Not sure what prefix_inverse_throughput_cycles
is used for. I believe it is referring to a previous block. @ondrasej will have more context on that. To me it kind of seems like support for an experiment at some point that didn't end up panning out (it doesn't show up in any of the importers), but I could be wrong there.
I see. Thanks, Aiden!
I'm sorry for the late response.
I do not understand the description of prefix_inverse_throughput_cycles in throughput.py: "The number of cycles of inverse throughput of the prefixes of the basic block.". There can be prefixes like REP for instructions, but what are the prefixes of a basic block?
The basic block prefixes have nothing to do with instruction prefixes (REP
, LOCK
, ...). At one point, we tested the idea that the models would be easier to train if we had per-instruction contributions for each basic block, and we could use them all at the same time during the training process. So, instead of measuring only
lea eax, [rdi-1]
imul eax, edi
shr eax
and training on the block as a whole, we'd measure also
lea eax, [rdi-1]
imul eax, edi
and
lea eax, [rdi-1]
and take the increment from N instructions to N+1 as the inverse throughput contributed by the one additional instruction. We'd then train the model to predict the inverse throughput as a sum of these increments.
This was more interesting for the LSTM-based (Ithemal-like) models, because in theory it could allow them to generalize to much longer sequences of instructions than can handle now. But in the end, it didn't bring the expected improvements, and in practice, models trained only with the sum performed just as well, but they were faster to train. But we didn't remove the code, and it's still there for people to experiment with.
For the graph models (GRANITE), this approach (compute per-instruction increments, and train them only with the sum) proved to work the best as well.
Btw. I've opened the discussion section here in the repository. Feel free to use it for any other questions.
Yes, this makes a lot of sense. Computing the delta contribution of each instruction should be able to enable longer sequences and more precise prediction. I guess this might be applied to identify performance bottlenecks, e.g. build a critical path.
Hello, is it accurate to say that inverse_throughput_cycles refers to CPI (cycles per instruction)?
I do not understand the description of prefix_inverse_throughput_cycles in throughput.py: "The number of cycles of inverse throughput of the prefixes of the basic block.". There can be prefixes like REP for instructions, but what are the prefixes of a basic block?
Thanks! Hongbo