ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Add viaHeritrix download option to WrenderProcessor #5

Closed anjackson closed 5 years ago

anjackson commented 7 years ago

The current WrenderProcessor expects warcprox to be used to capture the rendered resources. Although the quality may suffer, it may be useful to add a mode that lets Heritrix3 (re)download the resources, to make (initial?) deployment simpler.

It would just be a case of enqueueing the other URLs in the request-response entries as E links, and handing processing along the chain rather than skipping the rest of the processors.

Should also gobble up all those delicious cookies...

    "pages": [
      {
        "cookies": [
          {
            "domain": ".bl.uk", 
            "expires": "Fri, 17 Nov 2017 20:38:34 GMT", 
            "expiry": 1510951114, 
            "httponly": false, 
            "name": "__qca", 
            "path": "/", 
            "secure": false, 
            "value": "P0-1147487100-1476995914446"
          }, 
        ...
anjackson commented 5 years ago

As not capturing via warcprox is rather brittle, I'm no longer intending to pursue this.