404 SERP links - Githubissues

MarcosFP97 commented 2 years ago

Hi,

I am not sure if this is the write place to post this question, so my apologies in advance. I am scraping Google and getting the top-n link results for a query collection. Now I am trying to request those links to scrape the resulting pages. However, sometimes the links are broken -404 error- is not supposed that this should not happen? I mean the SE should filter out broken links.

On the other hand, sometimes my request petition gives me Client Error-404, despite the webpage actually exists. Could anyone provide me an orientation on this?

Thank you very much. Best, Marcos

tasos-py commented 2 years ago

Sometimes search engines return 404 links, it's not that uncommon. However, if you're getting 404 links frequently, maybe there is a bug in the code - URL encoding issue or something similar. Do you get 404 links very often? If you repeat the query in a browser, do you still get those links? Are they exactly the same? Could you give me a couple of such queries, to help me reproduce the issue?

MarcosFP97 commented 2 years ago

Hi @tasos-py,

Thanks for your quick answer. I am using TREC 2021 Health Misinfo queries, so for instance this is an example of query that I am using: "Will taking antioxidant supplements treat fertility problems?". Idk if this level of specifity is more likely to return 404 pages.

The code I am using is the following:

try:
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html.parser")

        for script in soup(["script", "style"]): # tag removal
            script.decompose() 

        doc = soup.get_text()
    except requests.exceptions.RequestException as e:
        doc = ""

Do I need to provide a custom header?

tasos-py commented 2 years ago

Ok, I think I see what he problem is now. You're right, you have to set custom headers, it will help in most cases. I run your query and I didn't get any 404 links, but I got many 403 and some 503 links. All 403 links turned to 200 when I changed the default User-Agent,

    r = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'})

But that didn't fix the 503 links. The reason is that those sites require Js to work properly. Here is the text of one of the sites that returned 503 with requests,

journals.sagepub.com 

Checking if the site connection is secure 

 Enable JavaScript and cookies to continue   

journals.sagepub.com needs to review the security of your connection before proceeding. 

Ray ID: 76de35be5bbe0c5f

Performance & security by [Cloudflare](https://www.cloudflare.com/?utm_source=challenge&utm_campaign=j)

Unfortunately, you won't be able to use requests for those sites, unless you're willing to reverse engineer a bunch of Js. However, you could use Selenium or similar clients that run Js.

MarcosFP97 commented 2 years ago

Ok, thank you very much for all your help @tasos-py. Deeply appreciated, best!

tasos-py / Search-Engines-Scraper

404 SERP links #53