kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.47k stars 635 forks source link

incorrect operation of comment extraction #264

Closed AntonGisin closed 3 years ago

AntonGisin commented 3 years ago

Hi guys, thank you for a web scraper! It looks realy nice but unfortunatelly it doesn't show comments. I ran this code:

for post in get_posts('FoxNews', pages=3, options={"comments": True}):
    print(post['text'])
    if post['comments_full']!= None :
        print("Comments:")
        for comment in post['comments_full']:
            print(comment['comment_text'])

but all posts look like this:

{'post_id': '10160674691536336', 'text': 'Israeli Prime Minister Benjamin Netanyahu said Sunday that his country is aiming to "degrade" Hamas to prevent future attacks.\n\nFOXNEWS.COM\nNetanyahu says Israel wants to \'degrade\' Hamas\' will, warns campaign will continue', 'post_text': 'Israeli Prime Minister Benjamin Netanyahu said Sunday that his country is aiming to "degrade" Hamas to prevent future attacks.', 'shared_text': "FOXNEWS.COM\nNetanyahu says Israel wants to 'degrade' Hamas' will, warns campaign will continue", 'time': datetime.datetime(2021, 5, 16, 21, 21, 34), 'image': 'https://static.foxnews.com/foxnews.com/content/uploads/2021/05/Netanyahu-Israel-Palestinian-Conflict-AP.jpg', 'image_lowquality': 'https://external.fhel6-1.fna.fbcdn.net/safe_image.php?d=AQGZSq_SQm-5-olG&w=476&h=249&url=https%3A%2F%2Fstatic.foxnews.com%2Ffoxnews.com%2Fcontent%2Fuploads%2F2021%2F05%2FNetanyahu-Israel-Palestinian-Conflict-AP.jpg&cfs=1&jq=75&ext=jpg&ccb=3-5&_nc_hash=AQGFWWTkKlZtr6HJ', 'images': ['https://static.foxnews.com/foxnews.com/content/uploads/2021/05/Netanyahu-Israel-Palestinian-Conflict-AP.jpg'], 'images_description': [], 'images_lowquality': ['https://external.fhel6-1.fna.fbcdn.net/safe_image.php?d=AQGZSq_SQm-5-olG&w=476&h=249&url=https%3A%2F%2Fstatic.foxnews.com%2Ffoxnews.com%2Fcontent%2Fuploads%2F2021%2F05%2FNetanyahu-Israel-Palestinian-Conflict-AP.jpg&cfs=1&jq=75&ext=jpg&ccb=3-5&_nc_hash=AQGFWWTkKlZtr6HJ'], 'images_lowquality_description': [None], 'video': None, 'video_duration_seconds': None, 'video_height': None, 'video_id': None, 'video_quality': None, 'video_size_MB': None, 'video_thumbnail': None, 'video_watches': None, 'video_width': None, 'likes': 890, 'comments': 2917, 'shares': 656, 'post_url': 'https://facebook.com/FoxNews/posts/10160674691536336', 'link': 'https://www.foxnews.com/world/netanyahu-israel-degrade-hamas?cmpid=fb_fnc', 'user_id': '15704546335', 'username': 'Fox News', 'user_url': 'https://facebook.com/FoxNews/?__tn__=C-R', 'is_live': False, 'factcheck': None, 'shared_post_id': None, 'shared_time': None, 'shared_user_id': None, 'shared_username': None, 'shared_post_url': None, 'available': True, 'comments_full': None, 'reactors': None, 'w3_fb_url': None}

A lot of comments, but comments_full is None. Could you help me please with this issue? Thank you!

neon-ninja commented 3 years ago

Hi, I think at this sort of scale, you might need to pass in cookies as per the readme. Additionally, when visiting https://m.facebook.com/FoxNews/posts/10160674691536336, I note that the page shows the link "View previous comments" instead of "View more comments" like the scraper expected - https://github.com/kevinzg/facebook-scraper/commit/3ce3db342e75102f761eba8119bf26c71234f4ed should fix this form of pagination.

AntonGisin commented 3 years ago

Hi, I think at this sort of scale, you might need to pass in cookies as per the readme. Additionally, when visiting https://m.facebook.com/FoxNews/posts/10160674691536336, I note that the page shows the link "View previous comments" instead of "View more comments" like the scraper expected - 3ce3db3 should fix this form of pagination.

Thank you very much @neon-ninja ! I downloaded my facebook cookies in txt file and referenced them. It helped for a short period of time. Nevertheless, after an hour it again stopped showing comments :( I tried to refresh cookies file, but the result is the same. Could facebook block this option for my IP?

neon-ninja commented 3 years ago

Yes - the scraper should throw an exception in that case. The comment extraction would handle the exception and log it. Try add these lines:

from facebook_scraper import *
import logging
enable_logging(logging.DEBUG)

and report back any log messages

AntonGisin commented 3 years ago

Yes - the scraper should throw an exception in that case. The comment extraction would handle the exception and log it. Try add these lines:

from facebook_scraper import *
import logging
enable_logging(logging.DEBUG)

and report back any log messages

The log is here:

Parsing page response
Got 4 raw posts from page
Extracting posts from page 512
[10158356852134071] Extract method extract_video didn't return anything
[10158356852134071] Extract method extract_video_thumbnail didn't return anything
[10158356852134071] Extract method extract_video_id didn't return anything
Fetching https://m.facebook.com/businessinsider/posts/10158356852134071
[10158356852134071] Extract method extract_video_meta didn't return anything
[10158356852134071] Extract method extract_factcheck didn't return anything
[10158356852134071] Extract method extract_share_information didn't return anything
No comments found on page
[10158356852134071] Exception while extracting comments: TypeError("'NoneType' object is not iterable")
[10158356819979071] Extract method extract_link didn't return anything

until 512 page it collects comments properly, then it stops doing it.

Thank you!

neon-ninja commented 3 years ago
posts = list(get_posts(
    post_urls=[10158356852134071],
    options = {"comments": True},
    timeout = 60,
    #cookies = "cookies.txt"
))

works fine, so the problem isn't with the post itself, but the volume of scraping you've been doing prior to scraping it. Perhaps at this sort of scale, you should keep records of which posts failed to extract comments, and come back to backfill them later, after whatever temporary block has worn off?