mrob27 commented 4 years ago

br instructions have an argument telling what label(s) they are branching to, and phi instructions have arguments telling what labels they might have arrived from, but there is no way to match up these labels with the blocks they belong to.

I am trying to do control-flow analysis, including determining how many different paths are possible through any given block, and which blocks are located inside loops. Code sample:

int sample(int arg) {
   for(int i=0; i<10; i++) {
      arg = (2*arg + 3*i) % 71;
   }
   return arg;
}

Compile this with -O1 and generate LL output, then use llvm.parse_assembly(). In the i32 @sample(i32) function there will be at least three blocks and a phi statement that "comes from" two of the blocks. Task: write a Python program that uses llvmlite to answer the question: "Which two blocks are the ones that might branch to that phi statement?"

Disclaimer: I don't know how to do this within the "real" LLVM API, and I am a new to Python3 programming. I don't see the answer in the (rather limited) documentation of the ValueRef class.

I am using Python 3.6.8, llvmlite 0.33.0, and clang 9.0.1 on a CentOS system. I built the clang from source but I don't know how to build python3 or llvmlite.

esc commented 4 years ago

@mrob27 thanks for submitting this! I have labelled it as a question for now.

stuartarchibald commented 4 years ago

I don't think that there's an API exposed in llvmlite to obtain the CFG as a structure for further parsing/manipulation, though it's possible to view the CFG like this:

Take your code sample and do e.g. clang -O1 -S -emit-llvm llvmlite_603.c -o llvmlite_603.ll to get some LLVM IR as text.

Do this:


from llvmlite import binding as ll
from ctypes import CFUNCTYPE, c_int

import llvmlite.binding as llvm

with open('llvmlite_603.ll') as f: llvm_ir = f.read()

All these initializations are required for code generation!

llvm.initialize() llvm.initialize_native_target() llvm.initialize_native_asmprinter() # yes, even this one

def create_execution_engine(): """ Create an ExecutionEngine suitable for JIT code generation on the host CPU. The engine is reusable for an arbitrary number of modules. """

Create a target machine representing the host

target = llvm.Target.from_default_triple()
target_machine = target.create_target_machine()
# And an execution engine with an empty backing module
backing_mod = llvm.parse_assembly("")
engine = llvm.create_mcjit_compiler(backing_mod, target_machine)
return engine

def compile_ir(engine, llvm_ir): """ Compile the LLVM IR string with the given engine. The compiled module object is returned. """

Create a LLVM module object from the IR

mod = llvm.parse_assembly(llvm_ir)
mod.verify()
# Now add the module and make sure it is ready for execution
engine.add_module(mod)
engine.finalize_object()
engine.run_static_constructors()
return mod

engine = create_execution_engine() mod = compile_ir(engine, llvm_ir)

fname = "sample" func_ptr = engine.get_function_address(fname)

Run the function via ctypes

cfunc = CFUNCTYPE(c_int, c_int)(func_ptr) res = cfunc(109) print("%s(...) =" % fname, res)

fn = mod.get_function(fname) CFG = ll.get_function_cfg(fn) ll.view_dot_graph(CFG).view()



RE:
> I am using Python 3.6.8, llvmlite 0.33.0, and clang 9.0.1 on a CentOS system. I built the clang from source but I don't know how to build python3 or llvmlite.

Installation options are highlighted here: http://llvmlite.pydata.org/en/latest/admin-guide/install.html#installation

Hope this helps?

mrob27 commented 4 years ago

Thanks for the extensive reply -- but it is not answering my question... but I think perhaps the approach you're suggesting wouldn't work.

Since I don't have graphics ability (I am working through a text-only connection) the llvmlite.binding.view_dot_graph(graph,...) method doesn't do anything. But looking at your program, it appears that it is executing the function with an argument (in this case 109). Though that value works with the sample function I gave, in general my program will not know what argument values might be typical or valid. I must treat the function as a black box. Some functions will have very slow execution times, and some may have multiple branches that are executed only in certain cases. Dynamic analysis is not an option, I need to use static analysis.

Returning to my question, I have a program that accepts an LL file as input; it uses parse_assembly, which gives a ValueRef that is a module; it iterates over its functions; for each function it iterates over its blocks; and for each block it can look at the first instruction (seeing if its opcode is "phi") and if that is the case then I want it to be able to figure out which of the function's blocks are the ones that might branch to that phi. I think labels are the way to do that, as that's what is in the actual LL file. Thus, the question was asking how to find out the labels of each block and how to find out the label(s) to which a "br" instruction branches.

robertmuth commented 4 years ago

I am in the same boat. Could the "name" field of ValueRef be repurposed to hold the block label?

robertmuth commented 4 years ago

some of the examples actually suggest that BBL name contains something meaningful but in my experiments it always returns ""

https://github.com/numba/llvmlite/blob/1f25fa723e419e77c325a135e10acfb9b6112e8f/examples/llvmir_iter.py#L39

lwerdna commented 2 years ago

I also want this feature.

Until then, you can parse the first line of the string representation of a block:

def get_block_label(block):
    lines = str(block).split('\n')
    try:
        i = 1 if lines[0] == '' else 0
        if m := re.match(r'^(\d+):', lines[i]):
            return m.group(1)
    except:
        return None

numba / llvmlite

A way to discover the label of a block (in ValueRef) #603

All these initializations are required for code generation!

Create a target machine representing the host

Create a LLVM module object from the IR

Run the function via ctypes