internetarchive / warcprox

WARC writing MITM HTTP/S proxy
379 stars 54 forks source link

warcproxy context manager? #18

Open trifle opened 8 years ago

trifle commented 8 years ago

Hi,

I've used warcproxy indirectly through the perma project who (as you probably know) do the phantomjs + warcproxy dance to create archives.

While reading and modifying the code I noticed that the proxy usage pattern almost perfectly matches the use case of context managers:

Would you consider adding such a context manager to the warcproxy project? Adding it here should be the best fit, in case the class API would need to be modified.

PS: cc @jcushman since their code might benefit from this (hope you don't mind the ping)

A rough sketch of the idea looks like this (pasted together from perma.cc code and simplified, not actually runnable):

@contextmanager
def warc_proxy(*args, **kwargs):
    """
    Context manager for warcproxy
    """
    # Set up proxy instance
    #  use kwargs with default arguments
    proxy = WarcProxy(server_address=('127.0.0.1'),
        kwargs.get('port', 27500),
        recorded_url_q=some_q,
        )
    writer_thread = WarcWriterThread(recorded_url_q=some_q)  
    proxy.warcprox_controller = WarcproxController(proxy, writer_thread)   
    proxy.warcprox_thread = threading.Thread(target=proxy.warcprox_controller.run_until_shutdown)
    proxy.warcprox_thread.start()

    try:
        # whatever we are yielding would need to carry all relevant data
        # such as adding the threads as instance attributes
        yield proxy
    finally:
        # tear down
        proxy.warcprox_controller.stop.set()
        proxy.warcprox_thread.join()

edit: Ah, and here is a simple usage example:

with warc_proxy(port=5000) as proxy:
    browser = setup_browser(ca=proxy.ca.ca_file, address=proxy.server_address)
    browser.do_stuff()
# proxy with all threads disappears at scope exit

Now if that's not tidy I don't know what is!

justinlittman commented 8 years ago

We've written a warprox context manager for Social Feed Manager: https://github.com/gwu-libraries/sfm-utils/blob/master/sfmutils/warcprox.py

In our case, we instantiate warcprox as a separate process rather than a separate thread.

On Sat, May 7, 2016 at 4:34 PM, Pascal Jürgens notifications@github.com wrote:

Hi,

I've used warcproxy indirectly through the perma project who (as you probably know) do the phantomjs + warcproxy dance to create archives.

While reading and modifying the code I noticed that the proxy usage pattern almost perfectly matches the use case of context managers:

  • set up background scaffolding (the proxy)
  • hand over a handle to the relevant context variables (a class instance or at least the CA file location and the ip:port address)
  • pull down everything once finished (join the threads)

Would you consider adding such a context manager to the warcproxy project? Adding it here should be the best fit, in case the class API would need to be modified.

PS: cc @jcushman https://github.com/jcushman since their code might benefit from this (hope you don't mind the ping)

A rough sketch of the idea looks like this (pasted together from perma.cc code and simplified, not actually runnable):

@contextmanager def warc_proxy(_args, *_kwargs): """ Context manager for warcproxy """

Set up proxy instance

#  use kwargs with default arguments
proxy = WarcProxy(server_address=('127.0.0.1'),
    kwargs.get('port', 27500),
    recorded_url_q=some_q,
    )
writer_thread = WarcWriterThread(recorded_url_q=some_q)
proxy.warcprox_controller = WarcproxController(proxy, writer_thread)
proxy.warcprox_thread = threading.Thread(target=proxy.warcprox_controller.run_until_shutdown)
proxy.warcprox_thread.start()
try:
    # whatever we are yielding would need to carry all relevant data
    # such as adding the threads as instance attributes
    yield proxy
finally:
    # tear down
    proxy.warcprox_controller.stop.set()
    proxy.warcprox_thread.join()

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/internetarchive/warcprox/issues/18

trifle commented 8 years ago

@justinlittman Thanks, that's pretty much like what I had in mind!

By the way, I had read about twarc but didn't know about SFM. Looks pretty cool! That said, I think the fact that almost everyone seems to need to fork warcproxy for their project is a sign that it might benefit from integrating changes back into the original project - at least that's what I'd love to see.

justinlittman commented 8 years ago

Agree.

I recently noticed that @nlevitt has a mess of changes underway in https://github.com/internetarchive/warcprox/pull/17. @nlevitt -- care to comment on the roadmap for 2?

nlevitt commented 8 years ago

This is great, thanks for the suggestion. I'll need to take a closer look to see where in the code it would live most comfortably.

For the questions about integrating outside changes, and 2.x, I opened #19. I'll comment more over there.

trifle commented 8 years ago

Great! Would you like a pull request written against #17, @nlevitt ?

ikreymer commented 8 years ago

@trifle I am curious about the use case with the context manager. I am working on a generalized component architecture for web archiving which will include a recording proxy, and it would be great to understand your particular use case with the context manager. (A screenshot creation workflow is something that I'd like to include especially).

I think the traditional approach is to start the proxy running in the background and have it record into a WARC (or several WARCs) over a period of time. When is it necessary to create a new proxy, wrapped in a context manager, for each request? Is it to create a new WARC for each request? Is it necessary to turn off the proxy for some other reasons? Or is it just for a one-off task that?

I'm guessing that it is to have more control over which WARC a request is recorded too, but perhaps there are other reasons.

justinlittman commented 8 years ago

In the case of Social Feed Manager, it is for control over which WARC a request is recorded to.

trifle commented 8 years ago

@ikreymer I'd certainly love to see such an architecture! (see #19)

Yes, a context manager would be for creating WARC files for a very small number of requests.

I guess the single-shot WARCs are a question of your scope: Where in web archiving your base units might be sites/domains that are crawled on one go, some people require control, error handling and access on a page level.

Projects such as perma will create one WARC per single webpage, since they archive individual articles and do one at a time. I'm a mass communication researcher (probably the same crowd that @justinlittman works with), which means that I routinely collect large (100s to 10000s) batches of articles spanning many domains. In most cases, the discovery process is not a crawl with a somewhat predictable frontier but rather external batch or stream evens (think twitter).

Now, in such a situation it's often quite inconvenient to produce large WARCs: The grouping in terms of time and order of incoming URLs at capture and at access time is probably unpredictable (bursty) and differs a lot. Which means that if I bundle records by domain while recording them but want to query across domains later on, that's going to be complicated.

nlevitt commented 8 years ago

@trifle a pull request against 2.x would be welcome. A pull request against master would also be welcome. Whichever or both. :)

nlevitt commented 8 years ago

@trifle @ilya 2.x supports a special request header called "warcprox-meta", which among a whole bunch of other things, lets you specify the name of the warc file (prefix actually, warcprox will add a serial number). That way you can write many small warcs using one long running warcprox process.

TheTechRobo commented 1 year ago

In my case I have an automated setting with third-party software to perform the actual archival. I'm using warcprox mainly for convenience so I don't have to write my own WARC addition to the software. A context manager would be nice so that I don't have to spawn an additional process for the proxy and could just include it in my (Python-based) code.