mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
https://mandiant.github.io/capa/
Apache License 2.0
4.08k stars 512 forks source link

python version memory leak #736

Closed SigmaStar closed 2 years ago

SigmaStar commented 3 years ago

Description

repeat call to capa.main.main(argv=args) would cause memory leak, although gc.collect() would collect most rubbish but there are still 10MB RSS memory remain unclean every time so it would eventually take away all the physical memory.

Steps to Reproduce

just call capa.main many times

Expected behavior:

release all memory allocated to capa.main

Actual behavior:

RSS memory consuming increased a lot.

Versions

capa 2.0.0

Additional Information

Nope

williballenthin commented 3 years ago

thanks for reporting this @SigmaStar!

it will take a little bit of work to profile the memory usage and i think its a worthwhile effort. we'll update here with the results of our triage and potential fixes.

williballenthin commented 3 years ago

recently when running the capa tests locally they've been getting OOM killed. i wonder if this is related.

image

williballenthin commented 3 years ago

i'm unable to reproduce the memory leak, at least as recorded by tracemalloc:

image

this output is generated by the script here: https://github.com/fireeye/capa/blob/master/scripts/profile-memory.py it runs capa.main.main() repeatedly and then shows the memory usage right at the end of the program. the output indicates that running capa.main.main() once and ten times results in approximately the same memory usage.

@SigmaStar can you provide any additional detail that would help me triage this potential memory leak?

williballenthin commented 3 years ago

I suppose the profile runs above show the memory usage from a python perspective, while @SigmaStar records memory via OS/RSS numbers. if the memory is fragmented and cannot be returned to the OS then i suppose RSS might continue to grow? the outlook here would probably not be good, so lets find a way to (dis)prove this theory.

williballenthin commented 3 years ago

re-running tweaked script that displays RSS and VMS along the way. 100 iterations. will have results in a bit...

image

SigmaStar commented 3 years ago

Yes, of course. I think maybe some rules or signatures lead to that memory leak. I test my code on both windows 10 1903 and ubuntu and on both platform it seems there is RSS memory increasing. We uses public rules (I think it can be cloned from github). Right now I use subprocess module in python instead of calling capa.main and force the operating system to recycle all the memory allocated. I notice that each time capa.main will use about 700 MB memory. So ten times of iteration will eventually take away at least 6 GB memory.

---Original--- From: "Willi @.> Date: Wed, Aug 25, 2021 07:08 AM To: @.>; Cc: @.**@.>; Subject: Re: [fireeye/capa] python version memory leak (#736)

i'm unable to reproduce the memory leak, at least as recorded by tracemalloc:

this output is generated by the script here: https://github.com/fireeye/capa/blob/master/scripts/profile-memory.py it runs capa.main.main() repeatedly and then shows the memory usage right at the end of the program. the output indicates that running capa.main.main() once and ten times results in approximately the same memory usage.

@SigmaStar can you provide any additional detail that would help me triage this potential memory leak?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

SigmaStar commented 3 years ago

It's odd. On my laptop it's i7-10875H cpu at frequency 4.5GHz. I specified the rules directory and each iteration costs approximately 5 min! So I think it must be some rules that causes capa failed to recycle memory allocated.

---Original--- From: "Willi @.> Date: Wed, Aug 25, 2021 07:23 AM To: @.>; Cc: @.**@.>; Subject: Re: [fireeye/capa] python version memory leak (#736)

re-running tweaked script that displayed RSS along the way. 100 iterations. will have results in a bit...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

williballenthin commented 3 years ago

If you have a few minutes, can you try to reproduce the behavior of the profile-memory.py script?

Once this run of 100 is done, then I'll also try under Windows (though I don't expect the results to be substantially different).

williballenthin commented 3 years ago

You can also try to copy the memory profiling code into your harness and see if it highlights the lines causing most allocations.

https://github.com/fireeye/capa/blob/33c3c7e106e945a8a633f2dec06eba936a1e9cc9/scripts/profile-memory.py#L69

SigmaStar commented 3 years ago

Yes, of course. But where can I download profile-memory.py?

---Original--- From: "Willi @.> Date: Wed, Aug 25, 2021 07:45 AM To: @.>; Cc: @.**@.>; Subject: Re: [fireeye/capa] python version memory leak (#736)

If you have a few minutes, can you try to reproduce the behavior of the profile-memory.py script?

Once this run of 100 is done, then I'll also try under Windows (though I don't expect the results to be substantially different).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

williballenthin commented 3 years ago

Incidentally, we also run capa as a subprocess in our clusters. This makes it easier to kill long running jobs without affecting other runs. Also tracking memory, CPU, etc.

That being said, its not an excuse for memory leaks, so I'd like to address this one if we can find it.

williballenthin commented 3 years ago

Yes, of course. But where can I download profile-memory.py?

https://github.com/fireeye/capa/blob/master/scripts/profile-memory.py#L69

SigmaStar commented 3 years ago

Ok, I will do that later and report the memory profiling as soon as possible.

---Original--- From: "Willi @.> Date: Wed, Aug 25, 2021 07:46 AM To: @.>; Cc: @.**@.>; Subject: Re: [fireeye/capa] python version memory leak (#736)

You can also try to copy the memory profiling code into your harness and see if it highlights the lines causing most allocations.

https://github.com/fireeye/capa/blob/33c3c7e106e945a8a633f2dec06eba936a1e9cc9/scripts/profile-memory.py#L69

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

williballenthin commented 3 years ago

run of 100 iterations:

image

image

image

image

while there's some variation, its not clear if its an upward trend of 10MB over 100 runs or just noise. definitely not on the order of 10MB/run.

SigmaStar commented 3 years ago

image image This is the output on my laptop (Windows 10 1903 with Miniconda python 3.8.5) and you can see the RSS increases for each run.

williballenthin commented 3 years ago

ahhh, this is very interesting and useful. the top 10 lines make it very clear that part of vivisect (specifically vtrace and symbol parsing) is keeping a lot of memory around. i'll dig into what its doing and if there's a way we can:

  1. fix this upstream in viv, and
  2. fix this locally, as well, maybe by manually clearing out some fields

thanks for collecting and sharing these results!

SigmaStar commented 3 years ago

By the way I changed the script as : image The rules are cloned from https://github.com/fireeye/capa-rules syscall_monitor is useless and it did nothing. you can use other executable from windows.

williballenthin commented 3 years ago

yup, that makes sense and looks fine. i think the results are reliable and it was due to linux vs windows that i didn’t see the same thing. i’ll reproduce first thing tomorrow.

On Aug 24, 2021, at 8:00 PM, SigmaStar @.***> wrote:

 By the way I changed the script as :

The rules are cloned from https://github.com/fireeye/capa-rules

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

williballenthin commented 3 years ago

interestingly i am not necessarily able to reproduce this: image image

notably i dont see vtrace retaining memory after the gc sweep; however, i think this is likely because i might not have PDBs available. let me see if i can work with the trace you posted above to identify the leak.

williballenthin commented 3 years ago

on windows (with dbghelp available) running against calc.exe with calc.pdb i still don't see the vtrace leak:

image

im using viv 1.0.4 so let me downgrade and try 1.0.3

williballenthin commented 3 years ago

downgrading to v1.0.3 didn't affect the profile results.

so, unfortunately, at this point i don't have enough information to reproduce this issue or triage it. some options:

  1. if you can dig into the traceback entries in your profile results and see what memory is leaked and why then we can open a fix upstream, or
  2. if you can reproduce the leak in a new virtualenv and share the precise steps and binaries then i can follow along and triage

otherwise, i dont have any ideas for how i can trace down the leak. do you?

SigmaStar commented 3 years ago

Sorry, It's 3 am in the morning. I still suspect that the problem is that some rules are poorly written and they invoke specific capa function in a special way and eventually cause cross-reference so gc.collect() cannot recycle that part memory. I will do a b-search test which means I will disable a half of rules and enable another part randomly and run the memory profiling. This would cut a lot of time cost and I can run 100 times. And repeat this process on the part that would causes the memory leak. I believe that for most rules, python can recycle the memory. If I really found one I will instantly report here.

---Original--- From: "Willi @.> Date: Thu, Aug 26, 2021 03:49 AM To: @.>; Cc: @.**@.>; Subject: Re: [fireeye/capa] python version memory leak (#736)

downgrading to v1.0.3 didn't affect the profile results.

so, unfortunately, at this point i don't have enough information to reproduce this issue or triage it. some options:

if you can dig into the traceback entries in your profile results and see what memory is leaked and why then we can open a fix upstream, or

if you can reproduce the leak in a new virtualenv and share the precise steps and binaries then i can follow along and triage

otherwise, i dont have any ideas for how i can trace down the leak. do you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

SigmaStar commented 3 years ago

I found something very interesting. I use that b-search method and instantly found that rules in load-code/pe/ significantly slow down the process and cause the memory consume at running. but these rules are very simple. Is that true that capa is running a virtual machine for this two rule (sorry I haven't inspect the capa source). And maybe the problem is because of these rules.

SigmaStar commented 3 years ago

I finally got capa memory-profile output, this time it's a little bit different: report.txt I run the capa.main for 50 times and now this is the report.

williballenthin commented 3 years ago

I believe these are the key lines in the report:

#1: D:\Software\Miniconda\lib\site-packages\vtrace\platforms\win32.py:1167: 164311.1 KiB
    class TI_FINDCHILDREN_PARAMS(Structure):
#2: D:\Software\Miniconda\lib\site-packages\vtrace\platforms\win32.py:2061: 145286.7 KiB
    self.symGetTypeInfo(typeIndex, TI_FINDCHILDREN, pointer(tif))
#3: D:\Software\Miniconda\lib\site-packages\vtrace\platforms\win32.py:1168: 17506.1 KiB
    _fields_ = [ ('Count', c_ulong), ('Start', c_ulong), ("Children",c_ulong * count),]

They indicate a memory leak in vivisect vtrace related to symbol parsing.

However, I cannot reproduce this behavior nor these results. As noted above, we need more information: either precise steps to rebuild the test environment or triage from you into the code to find the leak.

SigmaStar commented 3 years ago

Yes, My Windows Environment: Windows 10 10.0.19042.1165 Python version 3.8.5 capa version 2.0.0 rules and signatures are the same as described previously. My Linux Environment: Kernel 4.15.0-154-generic Ubuntu 18.04 LTS release Python version 3.6.9, gcc version 7.5.0

williballenthin commented 2 years ago

we are not able to reproduce this issue, so we won't be able to fix it. if we become able to reproduce the issue, then this will be a high priority item to address.