scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

Splash looses fragment part of url (# hash sign part) after redirect #470

Closed pawelmhm closed 5 years ago

pawelmhm commented 8 years ago

Today I stumbled on one bug that results from somewhat unusual behavior when redirecting to urls containing "#" hash sign.

I have an url http://groceries.asda.com/asda-webstore/landing/home.shtml#search/ibuprofen/1/relevance_desc , when you request this url without any cookies site responds with redirect to: https://groceries.asda.com/asda-webstore/landing/home.shtml (same url but over https, note that you need to view traffic with mitm proxy as dev tools seem to hide redirects to first url). After redirect my desktop browser (Chrome/51.0.2704.84) keeps hash part of url. Splash seems to discard hash part after redirect. This means that site will not be rendered properly because hash part is missing.

I added tests on branch, see here: https://github.com/scrapinghub/splash/commit/cdae7c490210b83a41031decae6d8312566bee57 to reproduce you need to simply:

# start mockserver
> python3 splash/tests/mockserver.py

create following lua

function main(splash)
    assert(splash:go(splash.args.url))
    return splash:url()
end

visit browser url: http://localhost:8998/redirect-hash#something you will see that hash is in url in input box.This confirms that test server is running and that your browser keeps url with hash even after redirect.

Now check splash:

http://localhost:8050/execute?lua_source=function+main%28splash%29%0A++++assert%28splash%3Ago%28splash.args.url%29%29%0A++++return+splash%3Aurl%28%29%0Aend&url=http%3A%2F%2Flocalhost%3A8998%2Fredirect-hash%23something-bad

will return http://localhost:8998/redirect-hash⏎ instaed of http://localhost:8998/redirect-hash#something-bad.

I found some workaround for this (luckily I can avoid redirects by using https url and sending some extra cookie) but I'm opening issue for future reference in case someone else stumbles on this. I see in QWebPage we have extension method and there is extensive comment there about redirects here, I'm not sure if it matters or how it matters here, but maybe that's some path for investigation.

kmike commented 6 years ago

It could be fixed when https://github.com/annulen/webkit/issues/623 is done.

kmike commented 5 years ago

This should be fixed by https://github.com/scrapinghub/splash/pull/928 (test works locally).