scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

disable browser/webkit caching ? #203

Open nwohaibi opened 9 years ago

nwohaibi commented 9 years ago

Hi, Thanks for the wonderful work on Spalsh

I just wanted to know if there is any way to disable browser caching of files? Or maybe return all HTTP requests made in har/log/entries, not just the ones with 200 http status ?

thanks in advance

kmike commented 9 years ago

Hi @nwohaibi,

Thanks!

I just wanted to know if there is any way to disable browser caching of files?

There is a way to do it in QWebKit (see http://doc.qt.io/qt-4.8/qnetworkrequest.html#CacheLoadControl-enum), but currently this option is not exposed by Splash. It is a good feature to have, but we need to design a public API for it and implement it.

Or maybe return all HTTP requests made in har/log/entries, not just the ones with 200 http status ?

HAR entries already contain all HTTP requests, not just the ones with 200 http status code. In case of cache some records may be missing because they are not requested at all. It should be possible to add them to the output as well, but I haven't checked the details; implementation may be not so straightforward.

nwohaibi commented 9 years ago

Thanks for taking the time to clarify :) Since I already have Splash in production, i might tackle the issue by modifying cache-control headers in HTTP responses. This way, WebKit would assume all resources are not to be cached.
let me know if I can be of any help and thanks again

starrify commented 8 years ago

Hi @kmike

There is a way to do it in QWebKit (see http://doc.qt.io/qt-4.8/qnetworkrequest.html#CacheLoadControl-enum), but currently this option is not exposed by Splash.

I used to believe that, and I even tried to make a PR that way. However later I realized that it is not the case. (Proved by local testings)

The QNetworkRequest::CacheLoadControl attribute shall be set for request instances, and it is Qt's network manager to decide whether to use a disk cache. However in the current implement of splash, caching in the network managers is not enabled at all (please check https://github.com/scrapinghub/splash/blob/master/splash/network_manager.py#L42)

As WebKit also has its own in-memory cache (for scripts, stylesheets, images, etc.), that is believed to be the real cause. In some specific scenarios it's required to strictly disable any kind of caching. Thus I made PR #339 for this.