scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 451 forks source link

Lua script does not work when using dynamic proxy #88

Closed playwolf719 closed 7 years ago

playwolf719 commented 7 years ago

qq 20161018235532 In the first picture, I make the request with dynamic proxy and the endpoint is execute

qq 20161018235755

In the second picture, I make the request with dynamic proxy and the endpoint is render,html

And I'm pretty sure that my proxy is ok. But why lua script does not work when using dynamic proxy? Hope you can help me. @kmike

This is my code

for x in xrange(0,3):
            try:
                script = """
                function main(splash)

                    assert(splash:go(splash.args.url) )

                    splash:wait(2)

                    return splash:evaljs("document.title")
                    --return splash:evaljs([[
                        -- document.querySelector('#sf-item-list-data').innerText;
                    -- ]]);
                    -- return {html=splash:html()}

                end
                """
                agent = random.choice(agents)
                time.sleep(1)
                authHeader = self.getAuthHeader()
                headers={
                    "User-Agent":agent,
                    "Proxy-Authorization":authHeader,
                    # "Referer":"http://www.bttt99.com/",
                }
                splash_args = {
                    'wait': 1,
                    "http_method":"GET",
                    "images":0,
                    "render_all":1,
                    "headers":headers,
                    'lua_source': script,
                    "proxy":"http://101.200.153.236:8123",
                }
                yield SplashRequest(self.house_pc_index_url+"&page="+str(x+1), self.parse_result, endpoint='execute',
                                    args=splash_args,dont_filter=True)
kmike commented 7 years ago

@playwolf719 /execute endpoint doesn't have special handling of 'headers' parameter (see http://splash.readthedocs.io/en/stable/api.html#execute). Probably that's the reason proxy doesn't work - auth is not correct. You need to handle it in your script, e.g. using splash:set_custom_headers.

kmike commented 7 years ago

By the way, instead of time.sleep(1) it can be better to use DOWNLOAD_DELAY scrapy option.

playwolf719 commented 7 years ago

@kmike Thx!!! It works, I have to use time.sleep for that my proxy auth is related with time. Thx anyway. But I have another question that the docker splash container is not that reliable. Sometimes It crashes. Do you have any suggestions?

kmike commented 7 years ago

@playwolf719 glad to see it helped!

I think for production it makes sense to run multiple Splash containers and use a load balancer, so that if one container crashes it can be restarted without affecting clients. You can implement it yourselves, use https://github.com/TeamHG-Memex/aquarium or use hosted Splash instance which takes care of that (like Scrapinghub's). See also: http://splash.readthedocs.io/en/stable/faq.html#how-to-run-splash-in-production

playwolf719 commented 7 years ago

@kmike Thx again!

playwolf719 commented 7 years ago

@kmike How to get the content that lua script return?

function main(splash)
                    splash:init_cookies(splash.args.my_cookie)
                    assert(splash:go{
                        splash.args.url,
                        http_method=splash.args.http_method,
                        headers=splash.args.headers,
                    })
                    splash:wait(1)
                    -- return splash:evaljs("document.title")
                    --return splash:evaljs([[
                        -- document.querySelector('#sf-item-list-data').innerText;
                    -- ]]);
                    --return {html=splash:html()}
                    local title = splash:evaljs("document.title")
                    return {title=title}
                    -- return {test=splash.args.my_cookie}

                end
kmike commented 7 years ago

@playwolf719 see https://github.com/scrapy-plugins/scrapy-splash#responses: response.data['title']