pirxthepilot / wtfis

Passive hostname, domain and IP lookup tool for non-robots
MIT License
889 stars 47 forks source link

Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

Open zbalkan opened 8 months ago

zbalkan commented 8 months ago

I used this solution in my wtfis-Wazuh integration and it works smoothly.

import diskcache

def __query_with_cache(target: str, config: Config, cache_dir: str = './') -> Optional[dict]:

    # Check if private IP or not
    if is_private(target=target):
        __debug(f"The target IP is in private range: {target}")
        return None

    # Create path for cache if not exists
    if os.path.exists(cache_dir) is False:
        os.makedirs(cache_dir, 0o700)

    __debug("Opening cache")
    with diskcache.Cache(directory=cache_dir) as cache:

        # Enable stats if not enabled on the first run
        cache.stats(enable=True)
        # Expire old items first
        cache.expire()

        __debug("Checking cache")
        cache_result: Optional[str] = cache.get(target)  # type: ignore

        if cache_result:
            __debug("Found the value in cache")
            return dict(json.loads(cache_result))

        else:
            __debug("Cache miss. Querying APIs...")

            # Initiate resolver
            resolver = Resolver(target, config)

            # Fetch data
            resolver.fetch()

            # Get result
            export = resolver.export()

            if export:
                __debug("Adding the response to cache")
                cache.add(target, json.dumps(export, sort_keys=True))
            else:
                return None

To make the code above understandable, I must give some context. I had to make wtfis a library that outputs JSON results for that integration. So, the external script can just call the library methods. I, first, stripped away all the UI related code, then created a wrapper class called Resolver, which includes the generate_entity_handler method inside. Then a fetch and an export method were added as the main interface to the library.

image

In wtfis you used environment variables stored in .env.wtfis file. In order to be able to integrate smoothly, I first created a class called Config, that I can pass to Resolver. One can use any method to create this Config class. In my case, Wazuh initiates the Python script with a bash script, along with arguments. So, I read the arguments, initiate the Config class instance, and pass it to the Resover along with the target IP or domain name.

image

These two methods are the interface of the wtfis library. Everything else was moved under wtfis.internal.

The code above then reads the SQLite-backed cache. I am using the defaults for cache settings. But it is possible to customize parameters, choose a different strategy, and have a shorter lifetime for cache.

The idea is to minimize the API usage. It may help in the long term.

pirxthepilot commented 8 months ago

Hey @zbalkan , this is awesome! Really cool how you were able to repurpose wtfis! :)

I think this is a good idea, but my concern is mostly the additional overhead in maintaining this feature. Some questions that come to mind:

Thanks!

zbalkan commented 8 months ago
pirxthepilot commented 8 months ago

Thanks @zbalkan !