Open williballenthin opened 6 months ago
on first brief glance...
39a91796fafe9d2efc2cea0de239179a3a2d406ea482af310710e6f5fed00083 hangs early:
...
DEBUG:viv_utils.flirt:found library function: 0x10481ff0: ?
DEBUG:viv_utils.flirt:found library function: 0x10482000: ?
DEBUG:viv_utils.flirt:found library function: 0x1049ae50: ?
and it's similar for 359f1f07a9d037c5d4ab95e56285d46c0c106a970235bbbcacdf06851626fabd
39a91796fafe9d2efc2cea0de239179a3a2d406ea482af310710e6f5fed00083 avfilter-7.dll
Size 6.77 MB
like @mr-tz mentioned, loading the workspace is taking a long time:
stuck here for seconds/minutes.
note that this is not a dedicated FLIRT matching phase; FLIRT matching happens while the workspace is loaded, and the stack trace below shows its not an issue with python-flirt.
CPU is pegged and RAM is growing:
stack trace at time of kill:
^CTraceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "...capa/capa/main.py", line 965, in <module>
sys.exit(main())
^^^^^^
File "...capa/capa/main.py", line 852, in main
extractor = get_extractor_from_cli(args, input_format, backend)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...capa/capa/main.py", line 755, in get_extractor_from_cli
return capa.loader.get_extractor(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...capa/capa/loader.py", line 254, in get_extractor
vw = get_workspace(input_path, input_format, sigpaths)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...capa/capa/loader.py", line 160, in get_workspace
vw.analyze()
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/__init__.py", line 819, in analyze
mod.analyze(self)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/analysis/generic/relocations.py", line 18, in analyze
vw.makePointer(va, follow=True)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/__init__.py", line 2107, in makePointer
self.followPointer(tova)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/__init__.py", line 780, in followPointer
self.makeFunction(va, arch=arch)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/__init__.py", line 1552, in makeFunction
realfva = self.cfctx.addEntryPoint(va, arch=arch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/envi/codeflow.py", line 294, in addEntryPoint
self._cb_function(va, {'CallsFrom': calls_from})
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/base.py", line 819, in _cb_function
vw.analyzeFunction(fva)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/__init__.py", line 832, in analyzeFunction
fmod.analyzeFunction(self, fva)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/analysis/i386/calling.py", line 137, in analyzeFunction
emu.runFunction(fva, maxhit=1)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/impemu/emulator.py", line 491, in runFunction
self.executeOpcode(op)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/envi/archs/i386/emu.py", line 255, in executeOpcode
newpc = meth(op)
^^^^^^^^
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/envi/archs/i386/emu.py", line 722, in i_call
self.doPush(saved)
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/envi/archs/i386/emu.py", line 407, in doPush
esp = self.getRegister(REG_ESP)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/impemu/platarch/i386.py", line 28, in getRegister
rval = value = e_i386.IntelEmulator.getRegister(self, index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...capa/.direnv/python-3.11/lib/python3.11/site-packages/envi/registers.py", line 295, in getRegister
def getRegister(self, index):
It looks to me like viv is taking a really long time to analyze this sample. If there are MBs of code, then this is a reasonable outcome.
Binary Ninja takes 208 seconds to find 12,344 functions over 0x4C0E00 code (about 4.9MB, a lot).
takeaways:
https://www.virustotal.com/gui/file/a0ca23f56230fc857f1246a5f8e9cb4742e90ce78122f7393de00a017028cbbd DaVinci_Deluxe.exe Size 15.74 MB (!!!)
loads pretty quickly in Binary Ninja, but there are only two local functions.
size of code is 0x583000, which is very large.
the two huge sections have entropy 8, so this seems mostly encrypted:
all sections are RWX:
so in summary, there's almost nothing usable here, but viv probably thinks it needs to disassemble 10MB or more.
takeaways:
https://www.virustotal.com/gui/file/a4f906f671f02b2cec47a8706e8b042f3cea0739dad15f24b92449a932203972 amd64 ELF for Android Size 729.45 KB
Binary Ninja loads in about 5 seconds. 408 functions, although I think a lot of analysis is missing.
viv is taking a long time to load the workspace:
...and mem:
initially spends a lot of time (many seconds) running cxxfilt to demangle names, but during this time, CPU/mem usage is low:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "capa/capa/main.py", line 965, in <module>
sys.exit(main())
^^^^^^
File "capa/capa/main.py", line 852, in main
extractor = get_extractor_from_cli(args, input_format, backend)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/capa/main.py", line 755, in get_extractor_from_cli
return capa.loader.get_extractor(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/capa/loader.py", line 254, in get_extractor
vw = get_workspace(input_path, input_format, sigpaths)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/capa/loader.py", line 149, in get_workspace
vw = viv_utils.getWorkspace(str(path), analyze=False, should_save=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/viv_utils/__init__.py", line 117, in getWorkspace
vw.loadFromFile(fp)
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/__init__.py", line 2824, in loadFromFile
fname = mod.parseFile(self, filename=filename, baseaddr=baseaddr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/parsers/elf.py", line 32, in parseFile
return loadElfIntoWorkspace(vw, elf, filename=filename, baseaddr=baseaddr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/parsers/elf.py", line 494, in loadElfIntoWorkspace
postfix = applyRelocs(elf, vw, addbase, baseoff)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/parsers/elf.py", line 728, in applyRelocs
dmglname = demangle(name)
^^^^^^^^^^^^^^
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/vivisect/parsers/elf.py", line 973, in demangle
import cxxfilt
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/cxxfilt/__init__.py", line 39, in <module>
libc = ctypes.CDLL(find_any_library('c'))
^^^^^^^^^^^^^^^^^^^^^
File "capa/.direnv/python-3.11/lib/python3.11/site-packages/cxxfilt/__init__.py", line 33, in find_any_library
lib = ctypes.util.find_library(choice)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/s31jwk4jsiqczzkrd8rcnjrhiyk2z4kf-devshell-dir/lib/python3.11/ctypes/util.py", line 257, in find_library
_get_soname(_findLib_gcc(name)) or _get_soname(_findLib_ld(name))
^^^^^^^^^^^^^^^^^
File "/nix/store/s31jwk4jsiqczzkrd8rcnjrhiyk2z4kf-devshell-dir/lib/python3.11/ctypes/util.py", line 241, in _findLib_ld
out, _ = p.communicate()
^^^^^^^^^^^^^^^
File "/nix/store/s31jwk4jsiqczzkrd8rcnjrhiyk2z4kf-devshell-dir/lib/python3.11/subprocess.py", line 1207, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/s31jwk4jsiqczzkrd8rcnjrhiyk2z4kf-devshell-dir/lib/python3.11/subprocess.py", line 2075, in _communicate
ready = selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/s31jwk4jsiqczzkrd8rcnjrhiyk2z4kf-devshell-dir/lib/python3.11/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
when viv is allocating all that memory (which spikes up and down, up to around 100GB at least), the program doesn't respond to ctrl-c, so i dont have a stacktrace yet.
can use py-spy
to show the stack trace at this point:
so it seems symboliks is taking a lot of memory?
after following the stacktrace a bit, it seems that there are either very complex or very many symbolic expressions being tracked, and this eats time and CPU.
if this is a prevalent bug, then we can look into disabling symboliks. or, we can rely on the user/system to kill capa when it takes too many resources. i don't immediately see any tricks to guessing this will happen.
looks like its: analyzeFunction (vivisect/analysis/generic/symswitchcase.py:1251
which is here: https://github.com/vivisect/vivisect/blob/9534f164954bd417767b6a5ac0a6185fd16ed942/vivisect/analysis/generic/symswitchcase.py#L374
looks like this is enabled for ELF, but not PE: https://github.com/vivisect/vivisect/blob/9534f164954bd417767b6a5ac0a6185fd16ed942/vivisect/analysis/__init__.py#L136
which we could disable with delFuncAnalysisModule
: https://github.com/vivisect/vivisect/blob/9534f164954bd417767b6a5ac0a6185fd16ed942/vivisect/__init__.py#L581
when this is disabled, analysis completes in a reasonable amount of time.
takeaways:
vivisect.analysis.generic.symswitchcase
function analysis modulehttps://www.virustotal.com/gui/file/a1c3dcb87b243005ed3bb2b88998adfb54b2cba01d92b401afd99f2027b7ef1e 64-bit DLL Size 447.62 KB
Binary Ninja takes only a few seconds to load.
no imports or exports. section names seem weird (after .reloc). im guessing this is a corrupt PE.
oh look at this section:
thats about 900 MB. and note that the subsequent sections overlap, so its definitely corrupt. and, if a naive PE loader tries to map this, it will create that 900MB section.
sure enough capa tries to allocate a large amount of memory:
takeaways:
https://www.virustotal.com/gui/file/359f1f07a9d037c5d4ab95e56285d46c0c106a970235bbbcacdf06851626fabd Size 92.00 KB 32-bit EXE
there's a weird initial section that (1) overlaps and is therefore invalid, and (2) is huge (1.4GB).
takeaways:
@williballenthin @mr-tz Hi maintainers,
I'm having a hard time using capa
with relatively large binary files (a few MBs to a few tens of MBs) because it will take a very long time to analyze them, or even get stuck. I'm not sure if this is a problem or a feature, so I'm attaching my confusion here rather than a new issue.
Does capa
have any tips or advances in analyzing large binaries?
We are aware of issues for analyzing large (and complex) binaries. Often times the underlying analysis framework of the standalone tool takes a longer time to process such samples. If you have the opportunity you could try an alternative backend such as IDA, Ghidra, or BinaryNinja. Of course, these can also take a while when processing very large samples.
If you're able to share a few samples we could also take a look at the details of the issues you're encountering.
@mr-tz Thanks for your explanation very much! I will try the alternative solution. (Fortunately, large samples make up only a small part of our test set)
By the way, may I ask about the time or computational complexity of capa
? If the sample has file size of M
, number of functions N
. (or capa
has smaller granularity)
@QGrain a few more details:
In #1950 we are developing a way for capa to use BinExport2 files, which are an intermediate representation of a disassembler's output. As mentioned, you could use IDA/Ghidra/Binary Ninja to produce the BinExport2 and then pass that to capa. Then you can see if it's the disassembler or the capa matching that is taking a long time. The default built-in disassembler within capa is pretty slow, especially for large programs.
The core work that capa does, matching rule logic against a programs features, is fairly optimized, as far as one can get within the bounds of Python. Its complexity is around O((#functions * #rules * A) + (#basic blocks * #rules * B) + (#instructions * #rules * C))
where A, B, and C are very small (0.01 or so). Ultimately, capa does do some work for every instruction in the program, and there's no getting around this.
The core work that capa does, matching rule logic against a programs features, is fairly optimized, as far as one can get within the bounds of Python. Its complexity is around
O((#functions * #rules * A) + (#basic blocks * #rules * B) + (#instructions * #rules * C))
where A, B, and C are very small (0.01 or so). Ultimately, capa does do some work for every instruction in the program, and there's no getting around this.
Get it. Thank you very much!
Investigate CPU and memory usage for the following samples. If it's something we're doing wrong, let's optimize that behavior. If its an issue with viv or other dependency, perhaps we can introduce heuristics to detect difficult samples and bail early (opt-in).
consolidated takeaways:
vivisect.analysis.generic.symswitchcase
function analysis modulein #1499 and #1500 we discuss adding a section scope and associated features. these could be used to match the first three points above. or, we could hardcode the logic into the viv workspace loader and have it raise an exception.