radareorg / radare2-r2pipe

Access radare2 via pipe from any programming language!
389 stars 99 forks source link

Running r2pipe Python in batch #123

Closed zavalyshyn closed 3 years ago

zavalyshyn commented 3 years ago

Describe the issue

I'm using r2pipe to extract callgraph info from all the binaries in a given folder. For each binary I first open it, then run an "aaa" command and then extract the callgraph in r2 commands format with "agC*" command. Now, there is no specific issue per se, r2pipe works as intended but it takes quite a lot of time to run through all the binaries.

I've checked the examples folder on how to use r2pipe in batch, but the code there is somehow simplified. I wonder what would be your suggestions on how to improve the code runtime. For instance, do I really need to quit r2 after each file?

How to reproduce?

Here is my code:

binaries_list = os.listdir(binaries_dir)
batchsize = 1000 # execute files in batches of 1000
total_count = len(binaries_list)

def parseglobalcallgraph(filename):
    filepath = os.path.join(binaries_dir, filename)
    r2 = r2pipe.open(filepath,["-e io.cache=true"])
    r2.cmd('aaa')
    gcg = r2.cmd("agC*") # extract global call graph in r2 commands format
    r2.quit()
    hash_value = hashlib.md5(gcg.encode()).hexdigest()
    return {'hash':hash_value, 'filename':filename}

for i in range(0, len(binaries_list), batchsize):
    batch = binaries_list[i:i+batchsize]
    with Pool(processes=10) as pool:
        results = pool.imap(parseglobalcallgraph, batch)
        pool.close()
        for res in results:
            if (res['hash'] not in hash_db):
                hash_db.add(res['hash'])
                print(res['hash'])
            else:
                continue

Expected behavior

I'd expect it to be much faster but seems like I'm missing something.

Possible fix

Screenshots

Additional context

trufae commented 3 years ago

r2pipe is slow, in part because of Python, in part because the way it reads the data from the pipe. you can use the native r2pipe by prefixing the filepath with ccall:// so it will use dlopen(r_core) and do direct C api calls. that will make the script at least 10 times faster.

You can help improving the r2pipe module and profiling that issue. other langs dont have this issue

zavalyshyn commented 3 years ago

Many thanks! I didn't know you could do that with prefixes