google / gematria

Machine learning for machine code.
Apache License 2.0
64 stars 11 forks source link

how to generate llvm_mnemonic and memory alias group ID? #124

Open ronghongbo opened 2 weeks ago

ronghongbo commented 2 weeks ago

Hello, for a sequence of instructions executed on an emulator, how to generate the right llvm_mnemonics and memory alias group IDs? In the emulator, I can see the opcode, operands, and memory locations accessed.

Also, can we get rid of llvm_mnemonics entirely? Its information should have already been expressed by opcode and operands. In basic_block.cc, Instruction::AddTokensToList() does not seem to treat llvm_mnemonics as a token either; so getting rid of it should not affect the sonnet.Embed layer, I guess?

Thanks! Hongbo

boomanaiden154 commented 1 week ago

If you have the opcode, you should be able to get the LLVM mnemonic through the LLVM APIs. I'm not sure that would be a supported use case upstream though given we don't have access to any (accurate) emulators out in the open.

It seems like we can remove llvm_mneomnic. I can't find anywhere it's actually used in the code, and there is this comment:

// The LLVM mnemonic of the instruction. Note that the LLVM mnemonics tend to
// change with LLVM versions, and we do not recommend using it in models.

@ondrasej would have more context. Removal I would estimate based on a quick glance would be pretty mechanical, but it would require touching quite a few files as llvm_mnemonic is fairly invasive.