scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

how to get redirect urls with scrapy-splash #152

Open 3xp10it opened 6 years ago

3xp10it commented 6 years ago

I don't know how to get the redirect urls with scrapy-splash,can you help me? eg. http://xxx.xxx.xxx/1.php will redirect to http://xxx.xxx.xxx/index.php,how can I get http://xxx.xxx.xxx/index.php with scrapy-splash? Below is my code which can not get http://xxx.xxx.xxx/index.php but get http://xxx.xxx.xxx/1.php

    def parse_get(self, response):
        item = CrawlerItem()
        item['code'] = response.status
        item['current_url'] = response.url
        ############################# below print http://xxx.xxx.xxx/1.php
        print(response.url)

self.lua_script = """
        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
              ['Cookie']='%s',
              }
              }
              )
          assert(splash:wait(0.5))

          splash:on_request(function(request)
              request:set_proxy{
                  host = "%s",
                  port = %d
              }
          end)

          return {cookies = splash:get_cookies(),html=splash:html()}
        end
        """ % (self.cookie,a[0],a[1])

url='http://xxx.xxx.xxx/1.php'
SplashRequest(url, self.parse_get, endpoint='execute', magic_response=True, meta={'handle_httpstatus_all': True}, args={'lua_source': self.lua_script})
lopuhin commented 6 years ago

@3xp10it splash handles redirects by itself, so the result you are getting is from a page where it was redirected. To get it's URL, you can add url = splash:url() to return values (see example in README below "Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values") - after that response.url should be from the redirected page.

3xp10it commented 6 years ago

@lopuhin

In my code,http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= will redirect to http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php, I try to add url = splash:url() ,but still fail:

self.lua_script = """
        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
              ['Cookie']='%s',
              }
              }
              )
          assert(splash:wait(0.5))

          splash:on_request(function(request)
              request:set_proxy{
                  host = "%s",
                  port = %d
              }
          end)

          return { url = splash:url(),  cookies = splash:get_cookies(), html = splash:html(), }
        end
        """ % (self.cookie,a[0],a[1])

    def parse_get(self, response):
        input(44444444444444)
        item = CrawlerItem()
        item['code'] = response.status
        item['current_url'] = response.url
        print(response.url)
        input(3333333333)
        if response.url=="http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=":
            print('fail ....................')
        if response.url=="http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php":
            print('succeed .................')

Below is the result:

2222222222222
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.93.139/robots.txt> (referer: None)
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= via http://192.168.89.190:8050/execute> (referer: None)
44444444444444
http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
3333333333
fail ....................
lopuhin commented 6 years ago

@3xp10it I see, that's not what I expected... Just to be sure, you are not turning off magic_response anywhere, and scrapy_splash.SplashMiddleware is used, right? Also, maybe you could try crawling https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F to check if it works for another domain?

3xp10it commented 6 years ago

@lopuhin The url you give me works well,below is the result:

http://httpbin.org/redirect-to?url=http://example.com/                                                                                             
2222222222222                                                                                                                                                         
2017-11-29 17:38:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)                                     
2017-11-29 17:38:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)                                                                                       
2017-11-29 17:38:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/redirect-to?url=http://example.com/ via http://192.168.89.190:8050/execute> (referer: None)
44444444444444                                                                      
http://example.com/                                                                                                                                                         
3333333333

It's a strange result,can you help me explain it?

lopuhin commented 6 years ago

@3xp10it in this case I would first check that the redirect is handled correctly by splash using the splash UI (visit splash URL with a browser and try loading the page you want to be crawled). If the redirect is handled differently by a browser and splash, this is a splash problem. Do you know how the redirect is implemented - if it's done in javascript, maybe more wait time will help?

3xp10it commented 6 years ago

@lopuhin

splash UI works well and return the right url:

url: "http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php"

my script used in splash UI is:

function main(splash, args)
  assert(splash:go{args.url,headers={
              ['Cookie']='security=impossible;PHPSESSID=q6ms9hf7sf8kingjhtespmpfu3;security=low',
              }
              }
              )
  assert(splash:wait(0.5))
  return {
    url = splash:url(),
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

That's to say,redirect is handled differently by a browser and splash?

lopuhin commented 6 years ago

@3xp10it this is great that this works in splash UI - this meant it's not a splash problem. But to be honest, now I'm not even sure where the problem can be. One more check that might help to debug this would be to print response.data - this should be a dict returned by splash script. If the url is redirected there, then the problem is in scrapy-splash middleware or in how it is used. If the url there is not what you want, then there could be some difference in the way splash is called between splash UI and the spider.

3xp10it commented 6 years ago

@lopuhin the url in response.data is not the redirected url I want,below is the result:

2222222222222                                               
2017-11-29 18:08:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.93.139/robots.txt> (referer: None)
2017-11-29 18:08:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 18:08:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= via http://192.168.89.190:8050/execute> (referer: None)
44444444444444                                        
http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
5555555555555                                               
{'url': 'http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=', 'html': '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><h
tml xmlns="http://www.w3.org/1999/xhtml"><head>\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\n\t\t<title>Vulnerability: Reflected Cross Site Scripting (XSS) :: Damn Vulnerable Web A

Should I change my middleware setting? Below is my settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    #'crawler.middlewares.ProxyMiddleware': 843,
}
lopuhin commented 6 years ago

@3xp10it The middlware settings you provided look good.. Since the url in response.data is not what you want, the problem must be not in how response is processed in scrapy-splash, but in how splash is called. Maybe you can try to use exactly the same script that works in splash UI for your spider?

3xp10it commented 6 years ago

@lopuhin I use exactly the same script that works in splash UI for my spider,it doesn't work:(,below is the script:

        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
         ['Cookie']='security=impossible;PHPSESSID=q6ms9hf7sf8kingjhtespmpfu3;security=low',
      }
              })
          assert(splash:wait(2))

          return { url = splash:url(),  cookies = splash:get_cookies(), html = splash:html(), }
        end
civanescu commented 5 years ago

So, is there any solution to see redirected url (the new one) inside scrapy-splash?

mutterkorn commented 5 years ago

I have the same problem and would be interested in a solution.

kai11 commented 5 years ago

Im not looking for this solution myself, but just an idea: if its possible to fetch HAR data using scrapy-splash, it can be used to figure out all redirects. https://splash.readthedocs.io/en/stable/api.html#render-har

civanescu commented 5 years ago

I have the same problem and would be interested in a solution.

I'm sorry, I lost too much time trying to resolve this and switch to a better solution - pypeteer. It doesn't generate a full HAR, but for my needs is enough.

I recommend to look to it also...

Gallaecio commented 4 years ago

This does not seem specific to scrapy-splash, shall we move this to https://github.com/scrapinghub/splash?