scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.1k stars 512 forks source link

localStorage doesn't work #221

Open kmike opened 9 years ago

kmike commented 9 years ago

We're trying to enable it by using

settings.setAttribute(QWebSettings.LocalStorageEnabled, True)

but this doesn't work because earlier we enable private mode:

settings.setAttribute(QWebSettings.PrivateBrowsingEnabled, True)

and when QWebKit is in private mode localStorage is disabled.

kfei commented 9 years ago

:+1:

kfei commented 9 years ago

I'm running an AngularJS based website and want to programatically take screenshot on it. And since the website is using local storage to store token for authentication. It'll helps a lot if Splash can set different key-value pairs in local storage for each browsing session.Thanks a lot.

dvdbng commented 9 years ago

I tried to see if there was a way to enable it without disabling private browsing, here is what I found (Everything only tested on qt4 branch): Changing the quotas via QWebSettings::setOfflineStorageDefaultQuota(), enabling via QWebSettings::enablePersistentStorage() or settings a per-origin rule via QWebSecurityOrigin::setDatabaseQuota() doesn't work. The code that checks if it can be used in private mode is here: https://github.com/qtproject/qtwebkit/blob/8b00fdada15a53c7764472435cffe04f22c3522f/Source/WebCore/storage/Storage.cpp#L159

Following the calls in that function you can see that it could be enabled for any protocol by calling SchemeRegistry::registerURLSchemeAsAllowingDatabaseAccessInPrivateBrowsing webkit function, unfortunately qt doesn't offer an API to call it.

Trying to overwrite the localStorage property of window either using a injected script or QWebFrame::addToJavaScriptWindowObject fails silently. I can't think of simple way of enabling localStorage without disabling private browsing.

dvdbng commented 9 years ago

Today I tried again to overwrite the localStorage in the window and succeeded, by using __defineGetter__ .

I wrote a localStorage shim that works in splash: https://gist.github.com/Youwotma/17d9b05fddd5ee4d9aa5

The data is not persisted anywhere, but it can be done, simply add a line in the update() function to save the data somewhere:

document.body.setAttribute('data-local-storage', JSON.stringify(storage));

Then extract the data from the returned HTML, save it somehow and load it again in a new webpage:

splash:runjs([[
var storageData = (<put the extracted json here somehow>);
for(var k in storageData){
    localStorage[k] = storageData[k];
}
]])

EDIT: This way to override the localStorage object only works in Splah Qt4.

dvdbng commented 9 years ago

Wonder if it would be interesting to have a repository or folder to keep useful splash scripts.

kmike commented 9 years ago

@Youwotma it'd be nice to have some support for https://luarocks.org/

AlexIzydorczyk commented 9 years ago

Why is private browsing enabled in the first place? just curious, I didn't know about this

kmike commented 9 years ago

@AlexIzydorczyk I think it is enabled to prevent cookies, history, etc. from leaking between requests. I don't know how useful is it, and we should disable it to enable localStorage. To do that we should make sure nothing leaks without localStorage (better with --slots=1).

AlexIzydorczyk commented 9 years ago

@kmike , thanks, makes sense. I actually have a use case for this, so I'll try disable private browsing and enabling local storage and see how it goes.

kmike commented 9 years ago

This ticket is important for https://github.com/scrapinghub/splash/pull/288 because it is not possible to override the window.localStorage object in qt5. So there is a workaround in qt4, but not in qt5.

dvdbng commented 9 years ago

My proposal to fix this:

The local storage path can be configured in a per-QWebPage basis, but the offlineStoragePath, and offlineWebApplicationCachePath and IconDatabasePath are global.

By using /tmp/ or a tmpfs we make sure that the files are cleaned up when the container/computer restarts even if splash crashes.

I'm not sure how does qtwebkit behave if two different tabs have different persistent storage path, since in a normal browser different tabs need to share data between them (but we should prevent this in splash). I'm going to make some tests to see if data is shared between tabs and update here.

Some of the data that we need to prevent from leaking between tabs:

dvdbng commented 9 years ago

Some tests by just disabling private mode and enabling localStorage:

Mode JS Cookies LocalStorage SessionStorage History
Private mode on, simultaneous sessions Leaks :ok: :ok: :ok:
Private mode on, non simultaneous sessions :ok: :ok: :ok: :ok:
Private mode off, simultaneous sessions Leaks Leaks :ok: Leaks[1]
Private mode off, non simultaneous sessions Leaks Leaks :ok: Leaks[1]

[1] - Leaks and it will make a difference when rendering (:visited styles applied to links), but it's not readable from javascript.

kmike commented 9 years ago

@Youwotma great analysis :+1: to clarify: did you check it with qt 5 or with qt4?

dvdbng commented 9 years ago

I checked with qt4

pawelmhm commented 8 years ago

what is the status of this after update to QT5? I need to handle webpage that requires local storage and actually breaks without local storage enabled

dvdbng commented 8 years ago

There is now the --disable-private-mode flag which will enable local storage, but the local storage data and other data like cookies and browser history will be kept and shared between different splash requests.

pawelmhm commented 8 years ago

thanks @Youwotma Does it make sense to add option enabling local_storage per request or maybe per netloc? Something like splash:enable_local_storage() @kmike @Youwotma

I'm worried about disabling private mode for all requests going via Splash. I only need local storage for one website, others dont need it and are ok with private_browsing.

pawelmhm commented 8 years ago

can we close that @kmike ? with recent splash master local storage works ok if you disable private_mode

javierfvargas commented 7 years ago

Guys, I experienced this very same problem. I was trying to render pages in private mode but they failed because they relied on the HTML5 Local Storage. I fixed it by forcing the storage to be enabled, that is, I searched for every line where you disable it according to the settings and forced it to be enable.

https://github.com/scrapinghub/splash/search?utf8=%E2%9C%93&q=LocalStorageEnabled

Set to: settings.setAttribute(QWebSettings.LocalStorageEnabled, True)

Then for every request I perform I have to send a JavaScript like this. js_source = "window.localStorage.clear();"

Sending the js_source is not a problem at all, but I have to patch all my instances of splash. Could you add an option to force LocalStorageEnabled to True even in private mode?

pawelmhm commented 7 years ago

@javierfvargas there is flag private_mode_enabled that should do what you need - splash.private_mode_enabled = false will enable local storage

javierfvargas commented 7 years ago

@pawelmhm, yes I know about the flag but as stated by the documentation "if you disable private mode then browsing data such as cookies or items kept in localStorage may persist between requests" and I don't want such behaviour but still I want the Local Storage to be able so that the page can be rendered.

An example of such page would be this one https://www.pcuonline2.org/pawtucketcredituniononline_40/uux.aspx#/login

Thanks.

danielnaab commented 7 years ago

FWIW, I just encountered this problem and was able to work around it by creating a Javascript profile that includes the Modernizr localStorage shim.

kennethkalmer commented 7 years ago

@danielnaab which shim did you use? The Mordernizr site lists several different shims

danielnaab commented 7 years ago

@kennethkalmer Not sure what you mean - I only see one. Try this link and click "build": https://modernizr.com/download?localstorage-setclasses&q=local

It probably shouldn't matter, but I'm using version 3.3.1.

kennethkalmer commented 7 years ago

@danielnaab Modernizr only provides the test, then you need to decide which polyfill to include if Modernizr.localstorage === false. On the link you sent, when selecting "Local Storage", on the right it has a list of 4 polyfills. I've tried the main one and a ton of variations on different gists with no luck.

What I'm testing now is something I stumbled on in 6b1033a7840 in the tests for splash.private_mode_enabled:

function main(splash)
    splash.private_mode_enabled = false
    assert(splash.private_mode_enabled == false)
    assert(splash:go(splash.args.url))
    assert(splash:wait(splash.args.wait))
    html = splash:html()
    splash.private_mode_enabled = true
    return html
end

It is "Good Enough", will need to do more testing to see if things are leaky though.

kmike commented 7 years ago

I was thinking it is fixed by upgrading to a more recent webkit, but it needs to be re-checked.

dalepo commented 6 years ago

I still have problems with different sessions sharing local storage, this makes impossible to scrape SPA sites concurrently. EDIT: I was running my environment with aquarium, we were able to fix it by setting one slot per splash instance. I don't know how slots work but it seems that they share resources somehow.