kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.27k stars 609 forks source link

requests.exceptions.TooManyRedirects: Exceeded 30 redirects. #879

Open Drzhivago264 opened 1 year ago

Drzhivago264 commented 1 year ago

It seems that Facebook manages to move some content into a redirect loop.

requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

I tried to set redirect (in session package) = False, but facebook-scraper throws this error: lxml.etree.ParserError: Document is empty

How Can I catch this error to skip the corrupted content? You can test with Postid: 2397025053807303 group: UkrainianAvstralia

Update: Dumb fixes for this: line 866-948 ` try: if kwargs.get("post"): kwargs.pop("post") response = self.session.post(url=url, kwargs) else: response = self.session.get(url=url, self.requests_kwargs, **kwargs) DEBUG = False if DEBUG: for filename in os.listdir("."): if filename.endswith(".html") and filename.replace(".html", "") in url: logger.debug(f"Replacing {url} content with {filename}") with open(filename) as f: response.html.html = f.read() response.html.html = response.html.html.replace('', '') response.raise_for_status() self.check_locale(response)

        # Special handling for video posts that redirect to /watch/
            if response.url == "https://m.facebook.com/watch/?ref=watch_permalink":
                post_url = re.search("\d+", url).group()
                if post_url:
                    url = utils.urljoin(
                        FB_MOBILE_BASE_URL,
                        f"story.php?story_fbid={post_url}&id=1&m_entstream_source=timeline",
                    )
                    post = {"original_request_url": post_url, "post_url": url}
                    logger.debug(f"Requesting page from: {url}")
                    response = self.get(url)
            if "/watch/" in response.url:
                video_id = parse_qs(urlparse(response.url).query).get("v")[0]
                url = f"story.php?story_fbid={video_id}&id={video_id}&m_entstream_source=video_home&player_suborigin=entry_point&player_format=permalink"
                logger.debug(f"Fetching {url}")
                response = self.get(url)

            if "cookie/consent-page" in response.url:
                response = self.submit_form(response)
            if (
                response.url.startswith(FB_MOBILE_BASE_URL)
                and not response.html.find("script", first=True)
                and "script" not in response.html.html
                and self.session.cookies.get("noscript") != "1"
            ):
                warnings.warn(
                    f"Facebook served mbasic/noscript content unexpectedly on {response.url}"
                )
            if response.html.find("h1,h2", containing="Unsupported Browser"):
                warnings.warn(f"Facebook says 'Unsupported Browser'")
            title = response.html.find("title", first=True)
            not_found_titles = ["page not found", "content not found"]
            temp_ban_titles = [
                "you can't use this feature at the moment",
                "you can't use this feature right now",
                "you’re temporarily blocked",
            ]
            if "checkpoint" in response.url:
                if response.html.find("h1", containing="We suspended your account"):
                    raise exceptions.AccountDisabled("Your Account Has Been Disabled")
            if title:
                if title.text.lower() in not_found_titles:
                    raise exceptions.NotFound(title.text)
                elif title.text.lower() == "error":
                    raise exceptions.UnexpectedResponse("Your request couldn't be processed")
                elif title.text.lower() in temp_ban_titles:
                    raise exceptions.TemporarilyBanned(title.text)
                elif ">your account has been disabled<" in response.html.html.lower():
                    raise exceptions.AccountDisabled("Your Account Has Been Disabled")
                elif (
                    ">We saw unusual activity on your account. This may mean that someone has used your account without your knowledge.<"
                    in response.html.html
                ):
                    raise exceptions.AccountDisabled("Your Account Has Been Locked")
                elif (
                    title.text == "Log in to Facebook | Facebook"
                    or response.url.startswith(utils.urljoin(FB_MOBILE_BASE_URL, "login"))
                    or response.url.startswith(utils.urljoin(FB_W3_BASE_URL, "login"))
                ):
                    raise exceptions.LoginRequired(
                        "A login (cookies) is required to see this page"
                    )
            return response
        except:
            pass`

line 1120-1126 facebook_scrapper.py try: post = extract_post_fn(post_element, options=options, request_fn=self.get) if remove_source: post.pop('source', None) yield post except: pass

Ahmedmagdy31 commented 1 year ago

I have exactly the same issue, can you please share the solution if you got it??

bipsen commented 1 year ago

Also experiencing this.

Drzhivago264 commented 1 year ago

You can add 2 (try, except pass) arguments at lines 866-948 and 1120-1126 in facebook_scrapper.py The corrupted content is passed, and you can only get the data without reactors and reactions from this corrupted content. I think Facebook changes something about reactors and reactions

rodigu commented 1 year ago

Are you running the script on Windows?