scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.02k stars 112 forks source link

Proxy issue #80

Closed scrapenetwork closed 2 years ago

scrapenetwork commented 2 years ago

Hello

using any example given in the example.py I am getting 407 error on a batch of valid proxies.

i ran an example with just scrapy and one with just requests, both work with the same proxy that playwright is pulling 407 on. Switching browsers/urls same issue as well.

I think is something with ssl maybe?

anyways the only way i know how to reproduce this issue is my using the proxy, as all sites pull 407 .

strange because the proxy works everywhere else but with this module. Any help would be appreciated

elacuesta commented 2 years ago

Are you setting the proxies according to https://github.com/scrapy-plugins/scrapy-playwright#proxy-support? I can't really do much without a (minimal) example and the resulting logs. See also https://github.com/scrapy-plugins/scrapy-playwright/issues/56#issuecomment-1033069738.

scrapenetwork commented 2 years ago

Correct settings the proxies correctly As other proxies work perfectly fine. Checking each ip to double check as well, i tried to reproduce but im unable to without the proxies which are giving this error. Those set of proxies work perfectly as well with other modules (ie scrapy,request,splash) but with playwright i get 407( on firefox its ns connection refused)

if you want i can send you the proxies so you can check it out, i tried to reproduce and monkey fix it but unable

elacuesta commented 2 years ago

Have you tried the same proxies with plain playwright-python? It is very hard to debug an issue without a code sample and/or execution logs. If you think there is a bug, please supply a minimal, reproducible example (emphasis on minimal). Also, to make sure you're facing an issue within the confines of this package, the example should work correctly by disabling scrapy-playwright.

scrapenetwork commented 2 years ago

Looks like you just need my proxy for an example

as any example will pull the same error with the proxy, if you want i can dm you it privately and you can confirm it is a strange ssl issue , other module same proxy works fine.

elacuesta commented 2 years ago

I do not take private inquiries or requests, this is a public issue tracker and I want conversations to remain public. If you want help from the community, you should provide steps to reproduce the issue, only excluding or redacting parts because of privacy concerns or financial restrictions (paid subscriptions, for instance). So far the only thing I've learned is that you're getting 407 responses.

scrapenetwork commented 2 years ago

Sadly this error it does not matter which example i draw up as whats needed looks like is the proxies which are getting this error. Any example i sent will work as its looks like its a certain set of proxies. I cant determine the difference myself from proxies , on my end they all looks the same, and other modules have no issues using the 407 error / network denied error ones that this module pulls.

Apologies as its something i can not draw up an example for, i can send you just the proxy and you can input in any example and will pull the errors. I just dont want to send the private proxy publicly is all. I will continue on my end and if any results i will come back and share the feedback.

elacuesta commented 2 years ago

Well in that case I'm afraid I can't do much more. I'd suggest you to contact the proxy provider, they might be able to assist you.

For completeness, I'm including the code I'm using to try proxies with authentication.

from scrapy import Spider, Request

class ProxySpider(Spider):
    name = "proxy"
    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": "***",
                "username": "***",
                "password": "***",
            },
        }
    }

    def start_requests(self):
        yield Request(url="http://httpbin.org/get", meta={"playwright": True})

    def parse(self, response):
        print(response.text)
import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        for browser_type in [p.firefox, p.chromium]:
            browser = await browser_type.launch(
                proxy={
                    "server": "***",
                    "username": "***",
                    "password": "***",
                },
            )
            context = await browser.new_context(ignore_https_errors=True)
            page = await context.new_page()
            await page.goto("https://httpbin.org/ip")
            print(await page.content())
            await browser.close()

if __name__ == "__main__":
    asyncio.run(main())