laluka / bypass-url-parser

bypass-url-parser
https://linktr.ee/TheLaluka
GNU Affero General Public License v3.0
1.02k stars 108 forks source link

Source code performance #13

Closed jtof-fap closed 2 years ago

jtof-fap commented 2 years ago

This pull request solves the awful performance problem at program start-up, when generating curl requests.

After investigation and a little profiling, it seems that the problem comes from the storage in list of generated curl_items:

for header in xxx:
    for internal in yyy:
    …
        if item not in self.curl_items:
            self.curl_items.append(item)
    …

To check if a CurlItem object is already in the list, the CurlItem __eq__() function is called to compare the attributes of two objects, returned in a tupple by the __attrs() method:

def __attrs(self):
    return (self.target_url, self.target_ip, self.bypass_mode, self.curl_base_options,
                  self.request_curl_cmd, self.response_raw_output)

def __eq__(self, other) -> bool:
    return isinstance(other, self.__class__) and self.__attrs() == other.__attrs()

For each item added to the curl_items list, the inserted object is compared to all existing items in the list (if item not in self.curl_items). So, to generate all payloads, __eq()__ is called 13001247 times, and __attr() 26002494 times. Damn!

image

 time ./bypass_url_parse.py --dump-payloads -u http://127.0.0.1:8000/foo/bar -v > /dev/null

real    0m25.489s
user    0m25.318s
sys     0m0.104s

It is possible (first commit) to refactor the __eq__() function to remove the unnecessary call to __eq() using the built-in dictionary __dict__ which contains everything required to compare two objects:

def __eq__(self, other: any) -> bool:
    return other.__class__ == CurlItem and self.__dict__ == other.__dict__

With 26002494 of less tupple, the code is already faster:

image

 time ./bypass_url_parse.py --dump-payloads -u http://127.0.0.1:8000/foo/bar -v > /dev/null

real    0m8.741s
user    0m8.662s
sys     0m0.052s

Not enough, the best solution is to abandon the list in favor of a set, much better adapted here. Sets uses the __hash()__ function to get a unique collection of values and compare two objects.

Initial CurlItem __hash__() function looks like:

def __hash__(self) -> int:
    return hash(self.__attrs())

No, not again __attr() :-( So we move to:

def __hash__(self) -> int:
    return hash(str(self.__dict__))

And the result is without appeal:

image

 time ./bypass_url_parse.py --dump-payloads -u http://127.0.0.1:8000/foo/bar -v > /dev/null

real    0m0.364s
user    0m0.315s
sys     0m0.048s