Feature Request w/Patch: Support simple hook/callback for long-running network operations (scan downloads)

mzpqnxow commented 3 years ago

Problem

When exporting/downloading very large scans (implemented in the export() method @ tenable.io.ScansAPI here) it can be frustrating waiting for the download to complete without any simple and customizable way to gauge and report the progress of the download, either to the console or to some other resource over the network or on the local system. The download happens in a very tight loop so there is no simple way to track the progress, let alone perform custom actions as it streams down

My application regularly uses pyTenable to download scans that exceed 1GB, which can take upwards of 10 minutes to complete. This leaves a user at the console without any idea of an ETA and it also prevents any applications supervising it from tracking its progress or capturing granular metrics. It's effectively a 10+ minute period of idle time, as far as the process monitor is concerned. For a user, it's either a blank screen or very ugly (in my opinion) logging module output, depending on the log level set for various loggers

Possible Solutions

To provide a real-time status update to the user at the console (or to a remote logging solution) the options that I am aware of are as follows:

Stick with whatever pyTenable provides - pyTenable makes use of Python's native logging system for its own loggers (tenable.io.scans, for example) and indirectly for logging third-party activity- in this case, urllib3, which is used by requests. I concede that it is simple to tune the log level for these loggers and make do with the information they provide, but that information is often very poorly formatted, too verbose, or not verbose enough, even when adjusting the log level on the loggers. It is possibly to add a custom log formatter, adapter, etc. as well, but this is clunky and limited as the Python logging module is not well-suited for stateful progress reporting
Monkey patching or overriding class methods or attributes - Monkey patching is a bad idea in general. While overriding certain methods is often a possible solution, at first glance it does not seem that this particular class/class method was designed to be extended in any simple way
Change ScansAPI.export() - make a small change to export() to accept an optional callable as a keyword argument, to permit the caller to request that the two line response.iter_content() loop be delegated to their own custom function. This works well in this case in my opinion because it is very lightweight and there is very little logic in the transfer loop so it can be done very simply

Ideal Solution

I came to the conclusion that the third option is the simplest and most flexible. The callable provided by the user will need to iterate over the stream using requests.Response.iter_content(), but this is very simple and compact logic. The user only needs to write the chunk to the fobj stream and is then free to add any small action after it, to execute on each iteration (or every n iterations, if they prefer). It also provides them the ability to customize the chunk size, though I'm not sure that's a serious limitation as is.

This commit is only a few lines and takes this approach. It adds an optional kwarg (stream_hook: Optional[Callable] = None) to the export() method in tenable.ioScansAPI. If set, the stream iteration loop is delegated to the user specified function. Because the changes are minimal and optional, they are very unlikely to impact stability or performance in any non-negligible way for current users as it only introduces one additional comparison

Caller Implementation

This is a very minimalist example of a stream_hook function that a user could implement in their application and pass to export(). It just emits a simple tqdm progress meter that shows an ETA and a transfer speed:

from tqdm import tqdm
import requests
from typing import BytesIO

...
def my_stream_hook(response: requests.Response, fobj: BytesIO) -> BytesIO:
    total_size = int(response.headers.get('content-length', 0))
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            progress_bar.update(len(chunk))
            fobj.write(chunk)
    progress_bar.close()

The developer would only need to add stream_hook to their call to TenableIO.scans.export():

- fobj = tio.scans.export(scan_id, history_id=scan_history_id)
+ fobj = tio.scans.export(scan_id, history_id=scan_history_id, stream_hook=my_stream_hook)

Alternate Uses

Besides using this as a basic progress meter for scan/report downloads for a user at the console, the general idea of providing simple, non-invasive hooks like this in key places has some other use-cases that I can think of, but I'm only really interested in this particular place. I'm not suggesting that every use of requests.Response.iter_content() be replaced with this sort of hook, though I would be happy to apply this sort of change anywhere else it may be determined useful to other users

As an example of a more substantive use, this provides the ability for the application to provide real-time status information to a separate application or service that may be acting as a supervisor or a metrics collector

Conclusion

If a PR for this is something that the maintainers would be open to, I would like to receive some feedback/suggestions on the details of the what the signature of the callable should be, assuming it is not acceptable as-is. It is probably best to send the chunk_size parameter to the callable, so the developer doesn't need to specify an arbitrary value for it themselves. The default value (1024) could be passed through to it to ensure consistent behavior:

def stream_hook(response: requests.Response, fobj: BytesIO, chunk_size: int) -> BytesIO:
    ...

If it was desired to leave room for providing the callable with additional context or other information in the future, the recommended signature could be something like this:

def stream_hook(response: requests.Response, fobj: BytesIO, chunk_size: int, **kwargs) -> BytesIO:
    ...

An completely different implementation would be something "heavier" - a true adapter-style class, with a newly written base class provided by pyTenable. I don't see any value in the additional complexity required for that approach, so I'm not suggesting it

If there is absolutely no interest in accepting an enhancement like this, I have no problem maintaining this on my own fork/branch, but it would be very nice for it to be considered as an upstream PR, especially considering the low likelihood of it negatively impacting current users of pyTenable

The branch for the above referenced change can be found here. I do not intend to submit it as a PR unless/until there is some interest from the maintainers in accepting it

Thank you, I appreciate any feedback on this and also appreciate the time and effort you put into maintaining this library. It's one of the most robust and well documented libraries when compared with those provided by other security vendors' products and services. I won't mention any names here, but I've found many vendors provide nothing more than a very basic test-suite and some stale or out of date documentation

mzpqnxow commented 3 years ago

I have added PR #311 for this as a proposal

mzpqnxow commented 3 years ago

I have created a new PR for this, #332. Rather than rebase the original one (which already had a bunch of clunky commits in it) I found it easier to do it this way. Please see #332 instead of #311

Thanks!

levwais commented 3 years ago

change merged, will be part of release 1.3.1

mzpqnxow commented 3 years ago

change merged, will be part of release 1.3.1

Thanks!

tenable / pyTenable