mboot-github / WhoisDomain

lookup whois data and format the response in a standarized way
https://mboot-github.github.io/WhoisDomain/
MIT License
43 stars 8 forks source link

whoisdomain.query() - memory leaks #30

Open michalselma opened 7 months ago

michalselma commented 7 months ago

Describe the bug

Running thousands of whoisdomain.guery() calls under multiprocesses or multithreads. After few hours I noticed OS mem consumption increase from 4-5GB to 30-35GB. After digging into my code and setting up more strict garbage collection came to conclusion that whoisdomain might be area of leak. Running under Win with default sysinternals whois.exe

To Reproduce

Python code:


import whoisdomain
import gc
from memory_profiler import profile

#Instead of result from db, get any array of domain names
sql_select = f'SELECT domain FROM three_letter_com LIMIT 50'
result = db.execute_single(sql_select, '')

@profile
def check():
    for item in result:
        print(f'Checking domain: {item[0]}')
        try:
            whoisdomain.query(item[0])
        except whoisdomain.WhoisPrivateRegistry as exc:
            print(exc)
        except whoisdomain.WhoisCommandFailed as exc:
            print(exc)
        except whoisdomain.WhoisQuotaExceeded as exc:
            print(exc)
        except whoisdomain.FailedParsingWhoisOutput as exc:
            print(exc)
        except whoisdomain.UnknownTld as exc:
            print(exc)
        except whoisdomain.UnknownDateFormat as exc:
            print(exc)
        except whoisdomain.WhoisCommandTimeout as exc:
            print(exc)
    gc.collect()

check()

Outputs

Run_01 - 10 domains

michalselma commented 7 months ago

To better visualize memory leak increase you can use this code:

import whoisdomain
import gc
from memory_profiler import profile
domains = ['google.com', 'microsoft.com', 'apple.com', 'dell.com', 'hp.com', 'ab.com', 'xy.com', 'tld.com',
           'samsung.com', 'ibm.com', 'lg.com', 'python.com', 'git.com', 'netflix.com', 'cisco.com', 'kfc.com',
           'nasa.com', 'esa.com', 'amazon.com', 'meta.com', 'godaddy.com', 'ovh.com', 'uber.com', 'siemens.com']

def check():
    for item in domains:
        print(f'Checking domain: {item}')
        whoisdomain_call(item)
    gc.collect()

@profile
def whoisdomain_call(domain):
    try:
        whoisdomain.query(domain)
    except whoisdomain.WhoisPrivateRegistry as exc:
        return
    except whoisdomain.WhoisCommandFailed as exc:
        return
    except whoisdomain.WhoisQuotaExceeded as exc:
        return
    except whoisdomain.FailedParsingWhoisOutput as exc:
        return
    except whoisdomain.UnknownTld as exc:
        return
    except whoisdomain.UnknownDateFormat as exc:
        return
    except whoisdomain.WhoisCommandTimeout as exc:
        return

check()
mboot-github commented 7 months ago

Thanks I will investigate, this is very helpful

On Fri, Feb 2, 2024, 22:52 M. Selma @.***> wrote:

To better visualize memory leak increase you can use this code:

import whoisdomain import gc from memory_profiler import profile domains = ['google.com', 'microsoft.com', 'apple.com', 'dell.com', 'hp.com', 'ab.com', 'xy.com', 'tld.com', 'samsung.com', 'ibm.com', 'lg.com', 'python.com', 'git.com', 'netflix.com', 'cisco.com', 'kfc.com', 'nasa.com', 'esa.com', 'amazon.com', 'meta.com', 'godaddy.com', 'ovh.com', 'uber.com', 'siemens.com']

def check(): for item in domains: print(f'Checking domain: {item}') whoisdomain_call(item) gc.collect()

@profile def whoisdomain_call(domain): try: whoisdomain.query(domain) except whoisdomain.WhoisPrivateRegistry as exc: return except whoisdomain.WhoisCommandFailed as exc: return except whoisdomain.WhoisQuotaExceeded as exc: return except whoisdomain.FailedParsingWhoisOutput as exc: return except whoisdomain.UnknownTld as exc: return except whoisdomain.UnknownDateFormat as exc: return except whoisdomain.WhoisCommandTimeout as exc: return

check()

— Reply to this email directly, view it on GitHub https://github.com/mboot-github/WhoisDomain/issues/30#issuecomment-1924754360, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7CCKLGETAZ3B3YQLLHJUTLYRVN2PAVCNFSM6AAAAABCXKLKXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUG42TIMZWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mboot-github commented 7 months ago

preliminary investigations show a steady increase of <class 're.Pattern'> type objects this is most likely a side effect of the new function based regex patterns used in the TLD regex dict
i will need to investigate deeper if i can either cache them or drop them after use.

see ./tests/memtest.py and ./tests/typescript

Misiu commented 6 months ago

Hi there, any updates on this issue? I recently noticed that Home Assistant is using your library (https://github.com/home-assistant/core/blob/dev/homeassistant/components/whois/manifest.json#L10) but in version 0.9.27. I'd like to refactor that integration to use the newest version because the old one isn't returning information about some of the domains I own. Sadly with this memory leak issue, I'm sure my PR to Home Assistant won't get approved.

mboot-github commented 6 months ago

Currently this is not a priority form as I'm low on time.

Preliminary investigations reveal no real memory leak other then increasing memory as we use previously unused tld's which is expected.

I see the whois integration uses cloud polling , looks like memory issues would not be a issue in that case. (If the who's component is not permanently loaded memory is released at the end of the program. )

On Thu, Mar 7, 2024, 13:20 Tomasz @.***> wrote:

Hi there, any updates on this issue? I recently noticed that Home Assistant is using your library ( https://github.com/home-assistant/core/blob/dev/homeassistant/components/whois/manifest.json#L10) but in version 0.9.27. I'd like to refactor that integration to use the newest version because the old one isn't returning information about some of the domains I own. Sadly with this memory leak issue, I'm sure my PR to Home Assistant won't get approved.

— Reply to this email directly, view it on GitHub https://github.com/mboot-github/WhoisDomain/issues/30#issuecomment-1983395185, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7CCKLEYC4LURJSXHP4JEDLYXBLR5AVCNFSM6AAAAABCXKLKXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGM4TKMJYGU . You are receiving this because you commented.Message ID: @.***>

Misiu commented 6 months ago

@mboot-github thank you for the reply. cloud_polling means the integration requires internet access (Home Assistant can work 100% offline).

The library is loaded into memory (https://github.com/home-assistant/core/blob/dev/homeassistant/components/whois/__init__.py#L26) and constantly used, the query is done once every 24 hours (https://github.com/home-assistant/core/blob/dev/homeassistant/components/whois/__init__.py#L38, https://github.com/home-assistant/core/blob/dev/homeassistant/components/whois/const.py#L15)

I'll try to update the integration to the newest version and well see if my domains will return the correct info, sadly right now (old version of the library) I get no info about .pl domains

mboot-github commented 6 months ago

a experimental fix is available in https://github.com/mboot-github/WhoisDomain/blob/master/testProc.py

It runs the whoisdomain.q2(domain=domain, pc=pc) in a different process and restarts that process after a specified N calls

so far all i can see that the "memory leak" is caused by the default caching of any new tls queried.