Improve sources memory consumption

projectdiscovery / subfinder

Fast passive subdomain enumeration tool.

https://projectdiscovery.io

MIT License

10.12k stars 1.27k forks source link

Improve sources memory consumption #279

Closed vzamanillo closed 4 years ago

vzamanillo commented 4 years ago

While doing some memory profiles with pprof I've discovered that some sources increase the memory footprint of subfinder in excess ej: waybackarchive

wayback1

This is because the size of the results is very large and we are using ioutil.ReadAll(pagesResp.Body).

After some changes to read the response stream using bufio.NewReader(pagesResp.Body) the memory consumption is drastically reduced.

wayback2

It happens in other sources too, especially in those that return json and no decoder is used to process it, but all the content is put in memory with ioutil.ReadAll (pagesResp.Body) and subdomainExtractor is used with regexp to match subdomains (ej: threatminer, threatcrowd...).

It would be nice to avoid using ioutil.ReadAll (pagesResp.Body) as long as possible and check the rest of the sources to use the json responses correctly.

We could do it after merging #278 or we could introduce them directly in that branch.

vzamanillo commented 4 years ago

First results after some rework, I have excluded github because it takes a long time to finish, but it increases the consumption by only about 5MB and keeps it constant until finished.

mem-test

ehsandeep commented 4 years ago

Hey @vzamanillo we didn't focus on memory profiling in the past because subfinder is not something we run all the time, mostly it's one time run before you start with your target, but definitely one of the things to improve to make it more mature.

Apart from memory consumption improvement, do you also notice an improvement in overall run time (as we can see in the above poc), is it a result of linting work on your side or little improvement because of better memory management?

vzamanillo commented 4 years ago

Hi @bauthard, there are no significant improvement in overall run time, there is in some cases, but the difference is not so important, in fact, in sources with large response data, such as commoncrawl or waybackarchive, it is a few milliseconds slower because the content of the responses is iterated line by line instead of putting everything in memory and processing the data.

These improvements in memory consumption are not in the branch of pull request #278, they are changes that I have made based on that branch, but I have them prepared to be able to merge them when #278 comes out (I think it is not time to introduce them in the #278 so as not to increase the cost of the review and because the scope of these changes is different from the changes we are talking about)

vzamanillo commented 4 years ago

Step by step guide to profile golang CPU / Memory.

Add profile package to main.go imports.

import (
    "context"

    "github.com/projectdiscovery/gologger"
    "github.com/pkg/profile"
    "github.com/projectdiscovery/subfinder/pkg/runner"
)

func main() {
    defer profile.Start().Stop() // CPU profiling (default)
       // defer profile.Start(profile.MemProfile).Stop() // Memory profiling
....
}

Runmain.go:

# go run main.go -d uber.com -sources alienvault

after finished you can see the following message:

2020/07/27 13:46:24 profile: cpu profiling disabled, /tmp/profile978571390/cpu.pprof

Run pprof and inspect the results (it will open a new browser window):

go tool pprof -http=:8080 /tmp/profile093511175/cpu.pprof

pprofui

freecodecamp pprof guide: https://www.freecodecamp.org/news/how-i-investigated-memory-leaks-in-go-using-pprof-on-a-large-codebase-4bec4325e192/