Open trifle opened 8 years ago
We've written a warprox context manager for Social Feed Manager: https://github.com/gwu-libraries/sfm-utils/blob/master/sfmutils/warcprox.py
In our case, we instantiate warcprox as a separate process rather than a separate thread.
On Sat, May 7, 2016 at 4:34 PM, Pascal Jürgens notifications@github.com wrote:
Hi,
I've used warcproxy indirectly through the perma project who (as you probably know) do the phantomjs + warcproxy dance to create archives.
While reading and modifying the code I noticed that the proxy usage pattern almost perfectly matches the use case of context managers:
- set up background scaffolding (the proxy)
- hand over a handle to the relevant context variables (a class instance or at least the CA file location and the ip:port address)
- pull down everything once finished (join the threads)
Would you consider adding such a context manager to the warcproxy project? Adding it here should be the best fit, in case the class API would need to be modified.
PS: cc @jcushman https://github.com/jcushman since their code might benefit from this (hope you don't mind the ping)
A rough sketch of the idea looks like this (pasted together from perma.cc code and simplified, not actually runnable):
@contextmanager def warc_proxy(_args, *_kwargs): """ Context manager for warcproxy """
Set up proxy instance
# use kwargs with default arguments proxy = WarcProxy(server_address=('127.0.0.1'), kwargs.get('port', 27500), recorded_url_q=some_q, ) writer_thread = WarcWriterThread(recorded_url_q=some_q) proxy.warcprox_controller = WarcproxController(proxy, writer_thread) proxy.warcprox_thread = threading.Thread(target=proxy.warcprox_controller.run_until_shutdown) proxy.warcprox_thread.start()
try: # whatever we are yielding would need to carry all relevant data # such as adding the threads as instance attributes yield proxy finally: # tear down proxy.warcprox_controller.stop.set() proxy.warcprox_thread.join()
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/internetarchive/warcprox/issues/18
@justinlittman Thanks, that's pretty much like what I had in mind!
By the way, I had read about twarc but didn't know about SFM. Looks pretty cool! That said, I think the fact that almost everyone seems to need to fork warcproxy for their project is a sign that it might benefit from integrating changes back into the original project - at least that's what I'd love to see.
Agree.
I recently noticed that @nlevitt has a mess of changes underway in https://github.com/internetarchive/warcprox/pull/17. @nlevitt -- care to comment on the roadmap for 2?
This is great, thanks for the suggestion. I'll need to take a closer look to see where in the code it would live most comfortably.
For the questions about integrating outside changes, and 2.x, I opened #19. I'll comment more over there.
Great! Would you like a pull request written against #17, @nlevitt ?
@trifle I am curious about the use case with the context manager. I am working on a generalized component architecture for web archiving which will include a recording proxy, and it would be great to understand your particular use case with the context manager. (A screenshot creation workflow is something that I'd like to include especially).
I think the traditional approach is to start the proxy running in the background and have it record into a WARC (or several WARCs) over a period of time. When is it necessary to create a new proxy, wrapped in a context manager, for each request? Is it to create a new WARC for each request? Is it necessary to turn off the proxy for some other reasons? Or is it just for a one-off task that?
I'm guessing that it is to have more control over which WARC a request is recorded too, but perhaps there are other reasons.
In the case of Social Feed Manager, it is for control over which WARC a request is recorded to.
@ikreymer I'd certainly love to see such an architecture! (see #19)
Yes, a context manager would be for creating WARC files for a very small number of requests.
I guess the single-shot WARCs are a question of your scope: Where in web archiving your base units might be sites/domains that are crawled on one go, some people require control, error handling and access on a page level.
Projects such as perma will create one WARC per single webpage, since they archive individual articles and do one at a time. I'm a mass communication researcher (probably the same crowd that @justinlittman works with), which means that I routinely collect large (100s to 10000s) batches of articles spanning many domains. In most cases, the discovery process is not a crawl with a somewhat predictable frontier but rather external batch or stream evens (think twitter).
Now, in such a situation it's often quite inconvenient to produce large WARCs: The grouping in terms of time and order of incoming URLs at capture and at access time is probably unpredictable (bursty) and differs a lot. Which means that if I bundle records by domain while recording them but want to query across domains later on, that's going to be complicated.
@trifle a pull request against 2.x would be welcome. A pull request against master would also be welcome. Whichever or both. :)
@trifle @ilya 2.x supports a special request header called "warcprox-meta", which among a whole bunch of other things, lets you specify the name of the warc file (prefix actually, warcprox will add a serial number). That way you can write many small warcs using one long running warcprox process.
In my case I have an automated setting with third-party software to perform the actual archival. I'm using warcprox mainly for convenience so I don't have to write my own WARC addition to the software. A context manager would be nice so that I don't have to spawn an additional process for the proxy and could just include it in my (Python-based) code.
Hi,
I've used warcproxy indirectly through the perma project who (as you probably know) do the phantomjs + warcproxy dance to create archives.
While reading and modifying the code I noticed that the proxy usage pattern almost perfectly matches the use case of context managers:
Would you consider adding such a context manager to the warcproxy project? Adding it here should be the best fit, in case the class API would need to be modified.
PS: cc @jcushman since their code might benefit from this (hope you don't mind the ping)
A rough sketch of the idea looks like this (pasted together from perma.cc code and simplified, not actually runnable):
edit: Ah, and here is a simple usage example:
Now if that's not tidy I don't know what is!