moda20 / facebook_page_scraper

Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV
https://pypi.org/project/facebook-page-scraper/
MIT License
13 stars 1 forks source link

scrap_to_json() returns error #1

Open dinonovak opened 5 months ago

dinonovak commented 5 months ago

Hi, unfortunately I am getting following error:

[WDM] - Driver [/Users/dino/.wdm/drivers/geckodriver/macos/v0.34.0/geckodriver] found in cache new layout loaded 2024-06-14 15:55:50,841 - facebook_page_scraper.driver_utilities - ERROR - Error at close_modern_layout_signup_modal: Message: Element

is not clickable at point (892,121) because another element
obscures it Stacktrace: RemoteError@chrome://remote/content/shared/RemoteError.jsm:12:1 WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:192:5 ElementClickInterceptedError@chrome://remote/content/shared/webdriver/Errors.jsm:291:5 webdriverClickElement@chrome://remote/content/marionette/interaction.js:166:11 interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11 clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:204:29 receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:92:31 Traceback (most recent call last): File "/Users/dino/Codings/python/FacebookRSSInformer/.venv/lib/python3.11/site-packages/facebook_page_scraper/driver_utilities.py", line 74, in __close_modern_layout_signup_modal close_button.click() File "/Users/dino/Codings/python/FacebookRSSInformer/.venv/lib/python3.11/site-packages/selenium/webdriver/remote/webelement.py", line 81, in click self._execute(Command.CLICK_ELEMENT) File "/Users/dino/Codings/python/FacebookRSSInformer/.venv/lib/python3.11/site-packages/selenium/webdriver/remote/webelement.py", line 710, in _execute return self._parent.execute(command, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/dino/Codings/python/FacebookRSSInformer/.venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute self.error_handler.check_response(response) File "/Users/dino/Codings/python/FacebookRSSInformer/.venv/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.ElementClickInterceptedException: Message: Element
is not clickable at point (892,121) because another element
obscures it Stacktrace: RemoteError@chrome://remote/content/shared/RemoteError.jsm:12:1 WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:192:5 ElementClickInterceptedError@chrome://remote/content/shared/webdriver/Errors.jsm:291:5 webdriverClickElement@chrome://remote/content/marionette/interaction.js:166:11 interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11 clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:204:29 receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:92:31

all_posts length: 3 no post_url, skipping no post_url, skipping no post_url, skipping all_posts length: 3 all_posts length: 7

alexgower commented 5 months ago

Same here

lullu57 commented 5 months ago

I think facebook have changed their layout. I have managed to fix this error, but I am running into many more. Below are fixes that I have applied:

Accept cookies before try to login:

def scrap_to_json(self, minimum_timestamp = None):
        # call the __start_driver and override class member __driver to webdriver's instance
        self.__start_driver()
        starting_time = time.time()
        # navigate to URL
        self.__driver.get(self.URL)
        # only login if username is provided
        Finder._Finder__accept_cookies(self.__driver)
        self.username is not None and Finder._Finder__login(self.__driver, self.username, self.password)

        self.__layout = Finder._Finder__detect_ui(self.__driver)
        # sometimes we get popup that says "your request couldn't be processed", however
        # posts are loading in background if popup is closed, so call this method in case if it pops up.
        Utilities._Utilities__close_error_popup(self.__driver)
        # wait for post to load
        elements_have_loaded = Utilities._Utilities__wait_for_element_to_appear(
            self.__driver, self.__layout, self.timeout)
        # scroll down to bottom most
        Utilities._Utilities__scroll_down(self.__driver, self.__layout)
        self.__handle_popup(self.__layout)
        # timestamp limitation for scraping posts
        timestamp_edge_hit = False
        while (not timestamp_edge_hit) and (len(self.__data_dict) < self.posts_count) and elements_have_loaded:
            self.__handle_popup(self.__layout)
            # self.__find_elements(name)
            timestamp_edge_hit = self.__find_elements(minimum_timestamp)
            current_time = time.time()
            if self.__check_timeout(starting_time, current_time) is True:
                logger.setLevel(logging.INFO)
                logger.info('Timeout...')
                break
            Utilities._Utilities__scroll_down(
                self.__driver, self.__layout)  # scroll down
        # close the browser window after job is done.
        Utilities._Utilities__close_driver(self.__driver)
        # dict trimming, might happen that we find more posts than it was asked, so just trim it
        self.__data_dict = dict(list(self.__data_dict.items())[
                                0:int(self.posts_count)])

        return json.dumps(self.__data_dict, ensure_ascii=False)

Change cookie selector:

def __accept_cookies(driver):
        try:
            # Use JavaScript to find the button containing the text "Allow all cookies"
            buttons = driver.execute_script("""
                return Array.from(document.querySelectorAll('div[role="none"] span'))
                            .filter(span => span.textContent.includes('Allow all cookies'));
            """)

            # Check if any elements were found
            if buttons:
                ActionChains(driver).move_to_element(buttons[-1]).click().perform()  # Click the last one if multiple are found
            else:
                logger.info("No 'Allow all cookies' button found.")
        except NoSuchElementException:
            logger.info("No such element exception occurred.")
        except IndexError:
            logger.info("Index error occurred.")
        except Exception as ex:
            logger.exception("Error at accept_cookies: {}".format(ex))
            sys.exit(1)

Change Login selector:

def __login(driver, username, password):
        try:

            wait = WebDriverWait(driver, 4)  # considering that the elements might load a bit slow

            # NOTE this closes the login modal pop-up if you choose to not login above
            try:
                element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '[aria-label="Close"]')))
                element.click()  # Click the element
            except Exception as ex:
                logger.debug(f"no pop-up")

            time.sleep(1)
            #target username
            username_element = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='email']")))
            password_element = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='pass']")))

            #enter username and password
            username_element.clear()
            username_element.send_keys(str(username))
            password_element.clear()
            password_element.send_keys(str(password))

            #target the login button and click it
            try:
                # Try to click the first button of type 'submit'
                WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div[role='button'][aria-label='Accessible login button']"))).click()
            except TimeoutException:
                # If the button of type 'submit' is not found within 2 seconds, click the first 'button' found
                WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button"))).click()
        except (NoSuchElementException, IndexError):
            pass
        except Exception as ex:
            logger.exception("Error at login: {}".format(ex))
            # sys.exit(1)

@moda20 if issue is replicable, let me know so that I create a PR

lullu57 commented 4 months ago

I have made it work by providing a URL, and have fixed some other fields such as name, image, that it does not wait for timeout if there are no posts (because of my needs). Feel free to have a look here, and see what can be implemented in original:

https://github.com/lullu57/facebook_page_scraper

@moda20 @shaikhsajid1111

edit: my version kind of requires the url and can maintain persistence between sessions (for my needs), but a lot of selectors and functionality has been improved. It is not a direct one is to one replacement though.