shaikhsajid1111 / facebook_page_scraper

Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV
https://pypi.org/project/facebook-page-scraper/
MIT License
222 stars 67 forks source link

CRITICAL - No posts were found! #91

Open talatoncu opened 8 months ago

talatoncu commented 8 months ago

I used the example you have given.

import Facebook_scraper class from facebook_page_scraper

from facebook_page_scraper import Facebook_scraper

instantiate the Facebook_scraper class

page_name = "##MYNAME##" posts_count = 10 browser = "firefox" proxy = "" #if proxy requires authentication then user:password@IP:PORT timeout = 600 #600 seconds headless = True meta_ai = Facebook_scraper(page_name, posts_count, browser, proxy=proxy, timeout=timeout, headless=headless)

json_data = meta_ai.scrap_to_json() print(json_data)

Following messages appear and I get no posts.

2024-01-04 09:53:29,565 - facebook_page_scraper.driver_initialization - INFO - Using: [WDM] - There is no [win64] geckodriver for browser in cache [WDM] - Getting latest mozilla release info for v0.34.0 [WDM] - Trying to download new driver from https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-win64.zip [WDM] - Driver has been saved in cache [C:\Users\Talat Oncu.wdm\drivers\geckodriver\win64\v0.34.0] 2024-01-04 09:54:31,409 - facebook_page_scraper.driver_utilities - CRITICAL - No posts were found! Exit code: 1

Then I used for NintendoAmerica

import Facebook_scraper class from facebook_page_scraper

from facebook_page_scraper import Facebook_scraper

instantiate the Facebook_scraper class

page_name = "NintendoAmerica" posts_count = 10 browser = "firefox" proxy = "" #if proxy requires authentication then user:password@IP:PORT timeout = 600 #600 seconds headless = True meta_ai = Facebook_scraper(page_name, posts_count, browser, proxy=proxy, timeout=timeout, headless=headless)

json_data = meta_ai.scrap_to_json() print(json_data)

The program gives the message

2024-01-04 10:11:18,586 - facebook_page_scraper.driver_initialization - INFO - Using: [WDM] - Driver [C:\Users\Talat Oncu.wdm\drivers\geckodriver\win64\v0.34.0\geckodriver.exe] found in cache

and waits indefinitely.

gayathriravipati commented 8 months ago

Have the same issue, checked what's happening by having headless to false

I see that the browser doesn't login and the following result is seen on terminal

2024-01-04 16:02:11,918 - facebook_page_scraper.driver_utilities - CRITICAL - No posts were found!

Can anyone help to figure out what can be done to figure this out. Thank you!

ExpiredMeteor6 commented 8 months ago

Hi all, i have the same issue when running on ubuntu but not on windows 11! Instead of the usual log in with the x in top right corner of widget we get a seperate page which requires a login before redirecting to the desired page.

If there is a way that we could login on then the webdriver would remember that and we would not get this issue, unfortunately everything i tried on this doesnt work. I have managed to solve the issue by coding my own facebook scraper using a chrome driver that can use a specific user data profile, but would prefer to use this if we can get a patch as less for me to maintain :D

Thanks

ExpiredMeteor6 commented 8 months ago

following on from this, i tried using a UK proxy which worked and produced the desired outcome

GazTrab commented 8 months ago

following on from this, i tried using a UK proxy which worked and produced the desired outcome

Could you tell the noob like me how to set proxy to UK?

ExpiredMeteor6 commented 8 months ago

following on from this, i tried using a UK proxy which worked and produced the desired outcome

Could you tell the noob like me how to set proxy to UK?

proxy='exampleproxy:exampleport' Facebook_scraper(page_name, posts_count, browser, proxy=proxy, timeout=timeout, headless=headless)

shaikhsajid1111 commented 8 months ago

@ExpiredMeteor6 Yes, using a Chrome profile that is already logged in will unblock you. Unfortunately, I cannot make that feature a part of this project as it claims that it can only scrape data available publicly

testproto commented 8 months ago

@shaikhsajid1111 is there any exception do you have which i can import in my code to handle the error [WDM] - Driver [C:\Users\manrkaur.wdm\drivers\geckodriver\win64\v0.34.0\geckodriver.exe] found in cache 2024-02-02 16:10:25,737 - facebook_page_scraper.driver_utilities - CRITICAL - No posts were found!

def scrape_facebook_data(page_names, posts_count=10, browser="firefox", proxy=None, timeout=600, headless=True): """ Scrapes Facebook data for the given page names.

Parameters:
- page_names: List of Facebook page names
- posts_count: Number of posts to scrape per page
- browser: Browser to use (e.g., "firefox")
- proxy: Proxy information (e.g., "IP:PORT" or None)
- timeout: Timeout in seconds
- headless: Whether to run the browser in headless mode

Returns:
- A dictionary containing the scraped data for each page
"""
scraped_data = {}

for page_name in page_names:
    # Instantiate the Facebook_scraper class
    meta_ai = Facebook_scraper(page_name, posts_count, browser, proxy=proxy, timeout=timeout, headless=headless)

    # Scraping data and converting it to JSON
    json_data_str = meta_ai.scrap_to_json()

    # Parse the JSON string into a dictionary
    json_data = json.loads(json_data_str)

    # Create an array to store post information
    posts_array = []

    # Iterate through each post and append to the array
    for post_id, post_data in json_data.items():
        time = post_data.get('posted_on', "")
        content = post_data.get("content", "")
        reaction_count = post_data.get('reaction_count',"")
        comments = post_data.get('comments',"")

        # Add a condition to check if content is not empty before appending
        if content:
            # Append post information to the array
            posts_array.append({
                # "Post ID": post_id,
                "Content": content,
                "Posted on": time,
                "reaction_count":reaction_count,
                "comments":comments
            })

    # Store the array for the current page in the result dictionary
    scraped_data[page_name] = posts_array

return scraped_data
shaikhsajid1111 commented 7 months ago

@testproto There isn't any custom Exception that it throws when no posts are found. You can write a wrapper function over this with try/except ?, If I'm understanding your requirement properly

testproto commented 7 months ago

@testproto There isn't any custom Exception that it throws when no posts are found. You can write a wrapper function over this with try/except ?, If I'm understanding your requirement properly

It throws error when any page is private so how i can handle that scenario? Could you please help me with that @shaikhsajid1111 ?

testproto commented 7 months ago

@testproto There isn't any custom Exception that it throws when no posts are found. You can write a wrapper function over this with try/except ?, If I'm understanding your requirement properly

from facebook_page_scraper import Facebook_scraper from facebook_page_scraper.driver_utilities import Utilities # Importing the Utilities class from your module from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup import json import re import requests

import logging

logging.basicConfig(level=logging.INFO) # Set the logging level to INFO or higher

def extract_facebook_page_name(url): """ Extracts Facebook page name from a given URL.

Parameters:
- url: URL of the website

Returns:
- Facebook page name if found, otherwise None
"""
try:
    # Make a direct request and check the response status
    response = requests.get(url)
    if response.status_code == 200:
        page_source = response.text
    else:
        # Use Selenium to get the page source if direct request fails
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(url)
        page_source = driver.page_source
        driver.quit()
except Exception as e:
    print(f"Error: {e}")
    return None

# Use BeautifulSoup to parse HTML and find Facebook page link
soup = BeautifulSoup(page_source, 'html.parser')
facebook_link = soup.find('a', href=re.compile(r'facebook\.com', re.IGNORECASE))

if facebook_link:
    # Extract page name from the Facebook link
    match = re.search(r'facebook\.com/([^/?]+)', facebook_link['href'])
    if match:
        # Check if the page is private
        if "page doesn't exist" in page_source or "The link you followed may be broken, or the page may have been removed" in page_source:
            print(f"The Facebook page at {url} is either private or does not exist.")
            return None
        else:
            return match.group(1)

return None

def scrape_facebook_data(page_names, posts_count=10, browser="firefox", proxy=None, timeout=600, headless=True): """ Scrapes Facebook data for the given page names.

Parameters:
- page_names: List of Facebook page names
- posts_count: Number of posts to scrape per page
- browser: Browser to use (e.g., "firefox")
- proxy: Proxy information (e.g., "IP:PORT" or None)
- timeout: Timeout in seconds
- headless: Whether to run the browser in headless mode

Returns:
- A dictionary containing the scraped data for each page, or None if no posts are found
"""
scraped_data = {}

for page_name in page_names:
    # Instantiate the Facebook_scraper class
    meta_ai = Facebook_scraper(page_name, posts_count, browser, proxy=proxy, timeout=timeout, headless=headless)

    try:
        # Scraping data and converting it to JSON
        json_data_str = meta_ai.scrap_to_json()

        # Parse the JSON string into a dictionary
        json_data = json.loads(json_data_str)

        # Create an array to store post information
        posts_array = []

        # Iterate through each post and append to the array
        for post_id, post_data in json_data.items():
            time = post_data.get('posted_on', "")
            content = post_data.get("content", "")
            reaction_count = post_data.get('reaction_count',"")
            comments = post_data.get('comments',"")

            # Add a condition to check if content is not empty before appending
            if content:
                # Append post information to the array
                posts_array.append({
                    # "Post ID": post_id,
                    "Content": content,
                    "Posted on": time,
                    "reaction_count":reaction_count,
                    "comments":comments
                })

        # Store the array for the current page in the result dictionary
        scraped_data[page_name] = posts_array

    except Exception as e:
        # Log the error as critical
        print(f"Error scraping data for page '{page_name}': {e}")
        continue  # Continue to the next page if an error occurs

# Check if any data was scraped
if not scraped_data:
    print("No posts were found for any of the provided pages.")
    return None

return scraped_data

def getSocialMedia(urls, posts_count=10, browser="firefox", proxy=None, timeout=600, headless=True): """ Scrapes Facebook data for the given URLs.

Parameters:
- urls: List of website URLs
- posts_count: Number of posts to scrape per page
- browser: Browser to use (e.g., "firefox")
- proxy: Proxy information (e.g., "IP:PORT" or None)
- timeout: Timeout in seconds
- headless: Whether to run the browser in headless mode

Returns:
- A dictionary containing the scraped data for each page
"""
page_names = []

for url in urls:
    # Extract Facebook page name from the URL
    page_name = extract_facebook_page_name(url)

    if page_name:
        page_names.append(page_name)

# Call the function to scrape Facebook data using extracted page names
result = scrape_facebook_data(page_names, posts_count, browser, proxy, timeout, headless)

return result

Set up logging configuration

Example usage:

if name == "main":

List of website URLs

urls = ['https://testmatick.com/', 'https://www.a1qa.com/']

# Common configuration for scraping
posts_count = 10
browser = "firefox"
proxy = "IP:PORT"  # if proxy requires authentication then user:password@IP:PORT
timeout = 600  # 600 seconds
headless = True

# Dictionary to store scraped data
result = {}

for url in urls:
    # Extract Facebook page name from the URL
    page_name = extract_facebook_page_name(url)

    if page_name:
        try:
            # Call the function to scrape Facebook data for the current URL
            page_data = scrape_facebook_data([page_name], posts_count, browser, proxy, timeout, headless)

            if page_data:
                # Add the scraped data to the result dictionary
                result.update(page_data)
            else:
                print(f"No posts found for URL: {url}")
                continue  # Continue to the next URL if no posts are found

        except Exception as e:
            print(f"Error scraping data for URL '{url}': {e}")
            continue  # Continue to the next URL if an error occurs

    else:
        print(f"No Facebook page found for URL: {url}")
        continue  # Continue to the next URL if no Facebook page is found

# Check if result is empty and return None if it is
if not result:
    print("No Facebook data found for the provided URLs.")
    result = None

# Print the result
print(json.dumps(result, indent=2))
# List of website URLs
urls = ['https://testmatick.com/', 'https://www.a1qa.com/']

# Common configuration for scraping
posts_count = 10
browser = "firefox"
proxy = "IP:PORT"  # if proxy requires authentication then user:password@IP:PORT
timeout = 600  # 600 seconds
headless = True

# Dictionary to store scraped data
result = {}

for url in urls:
    # Extract Facebook page name from the URL
    page_name = extract_facebook_page_name(url)

    if page_name:
        try:
            # Call the function to scrape Facebook data for the current URL
            page_data = scrape_facebook_data([page_name], posts_count, browser, proxy, timeout, headless)

            if page_data:
                # Add the scraped data to the result dictionary
                result.update(page_data)
            else:
                print(f"No posts found for URL: {url}")
                continue  # Continue to the next URL if no posts are found

        except Exception as e:
            print(f"Error scraping data for URL '{url}': {e}")
            continue  # Continue to the next URL if an error occurs

    else:
        print(f"No Facebook page found for URL: {url}")
        continue  # Continue to the next URL if no Facebook page is found

# Check if result is empty and return None if it is
if not result:
    print("No Facebook data found for the provided URLs.")
    result = None

# Print the result
print(json.dumps(result, indent=2))
# List of website URLs
urls = ['https://testmatick.com/', 'https://www.a1qa.com/']

# Common configuration for scraping
posts_count = 10
browser = "firefox"
proxy = "IP:PORT"  # if proxy requires authentication then user:password@IP:PORT
timeout = 600  # 600 seconds
headless = True

# Dictionary to store scraped data
result = {}

for url in urls:
    # Extract Facebook page name from the URL
    page_name = extract_facebook_page_name(url)

    if page_name:
        try:
            # Call the function to scrape Facebook data for the current URL
            page_data = scrape_facebook_data([page_name], posts_count, browser, proxy, timeout, headless)

            if page_data:
                # Add the scraped data to the result dictionary
                result.update(page_data)
            else:
                print(f"No posts found for URL: {url}")
                continue  # Continue to the next URL if no posts are found

        except Exception as e:
            print(f"Error scraping data for URL '{url}': {e}")
            continue  # Continue to the next URL if an error occurs

    else:
        print(f"No Facebook page found for URL: {url}")
        continue  # Continue to the next URL if no Facebook page is found

# Check if result is empty and return None if it is
if not result:
    print("No Facebook data found for the provided URLs.")
    result = None

# Print the result
print(json.dumps(result, indent=2))
# List of website URLs
urls = ['https://testmatick.com/', 'https://www.a1qa.com/']

# Common configuration for scraping
posts_count = 10
browser = "firefox"
proxy = "IP:PORT"  # if proxy requires authentication then user:password@IP:PORT
timeout = 600  # 600 seconds
headless = True

# Dictionary to store scraped data
result = {}

for url in urls:
    # Extract Facebook page name from the URL
    page_name = extract_facebook_page_name(url)

    if page_name:
        try:
            # Call the function to scrape Facebook data for the current URL
            page_data = scrape_facebook_data([page_name], posts_count, browser, proxy, timeout, headless)

            if page_data:
                # Add the scraped data to the result dictionary
                result.update(page_data)
            else:
                print(f"No posts found for URL: {url}")
                continue  # Continue to the next URL if no posts are found

        except Exception as e:
            print(f"Error scraping data for URL '{url}': {e}")
            continue  # Continue to the next URL if an error occurs

    else:
        print(f"No Facebook page found for URL: {url}")
        continue  # Continue to the next URL if no Facebook page is found

# Check if result is empty and return None if it is
if not result:
    print("No Facebook data found for the provided URLs.")
    result = None

# Print the result
print(json.dumps(result, indent=2))

**Seee i have using try except block but this code exists after checking for testmatick and not go to next url exists by saying critical no posts found**