spcl / ncc

Neural Code Comprehension: A Learnable Representation of Code Semantics
BSD 3-Clause "New" or "Revised" License
206 stars 51 forks source link

llvm ir of linux kernel #15

Closed Baumanar closed 5 years ago

Baumanar commented 5 years ago

Hello, firstly thanks for your interesting paper and for releasing its code. This issue Is similar to #1 . I'm currently working on ways to generate the llvm-ir files of the linux kernel. So I compiled the kernel using Clang, and then I used a python script I found on github (https://github.com/ClangBuiltLinux/linux/blob/master/scripts/gen_compile_commands.py) that parses the .cmd files generated alongside the compilation and generates a Json file of the Clang commands with the correct linkers that were run to compile the kernel. Here is an example of these commands:

/usr/bin/clang-9 -Wp,-MD,fs/.pnode.o.d -nostdinc -isystem /usr/lib/llvm-9/lib/clang/9.0.0/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -D__KERNEL__ -Qunused-arguments -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -Wno-unused-variable -Wno-format-invalid-specifier -Wno-gnu -Wno-address-of-packed-member -Wno-tautological-compare -mno-global-merge -no-integrated-as -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 -DCC_HAVE_ASM_GOTO -Wframe-larger-than=2048 -fno-stack-protector -fomit-frame-pointer -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-overrides -Wno-unused-value -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-uninitialized -DKBUILD_BASENAME='\"pnode\"' -DKBUILD_MODNAME='\"pnode\"' -c -o fs/pnode.o fs/pnode.c

To build the llvm-ir files (.ll), I replaced the end of the command -c -o fs/pnode.o fs/pnode.c with -S -emit-llvm fs/pnode.c -o llvm-ir/fs_pnode.ll I managed to create 2334 llvm-ir files with version 4.15.1 of the linux kernel.

My questions are:

tbennun commented 5 years ago

The compilation to LLVM IR is project-specific, due to makefiles and compiler compatibility. The way we generated Linux kernel LLVM IR files was very similar to yours - we switched the compiler to clang, and when building the kernel we logged all commands (verbose mode). Following that, we swapped every -o <file.o> with -S -emit-llvm -o <file.ll> in clang build commands.

I do not exactly know why you are able to generate more LLVM-IR files, perhaps the script you found is better at detecting build commands, or (which is more likely IMO) your kernel configuration included additional kernel modules that we did not compile.