kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.45k stars 633 forks source link

Not all comment text scraped on a fairly new post on a Facebook group #1111

Closed jkapali closed 2 months ago

jkapali commented 2 months ago

Hi, thank you for this amazing library. I'm running into issues with the scraper on a new post on a group that was made yesterday.

When running post[comment] i get the correct number of comments, however, when getting the comment text from post[comment_full], I only get the first 40 comments on the text side on each run. I'm running into this error on a new post. Also, the progress bar does not increment with this new post.

However, when running this on an old post existing on the Facebook Group, i get the proper amount of comment text. The progress bar moves in increments properly. Is this a rate limiting issue? Or an error in calling the get_posts function? It's just odd that the old post works perfectly fine but the new post doesn't get all the comments. See below for the log on the new post (couldn't get the log for the old post because I got temporarily banned)

groupid: hawaiitracker This is the new post: 2849538805211919 This is the old post: 2561008114064991

This is the log on the new post(note the progress bar):

Requesting page from: https://m.facebook.com/groups/hawaiitracker/posts/2849538805211919/
Fetching https://m.facebook.com/photo.php?fbid=1545241029678366&id=100025774523185&set=gm.2849538805211919&eav=AfbkS2KQHc_hvoQ7MStix0Nrz1uG6seXIT95WjeObJtFDcz_1P8jD1HU5rjgyGgoYDo&paipv=0&source=57&refid=18&__tn__=EH-R
Fetching https://m.facebook.com/photo/view_full_size/?fbid=1545241029678366&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg&paipv=0&eav=AfZ4kt9N71DWYlUz4HQuU3nRH6SV458-\-\aW-1D9AvBhhm_70z4MO-KhY3u3k6alAno0
[2849538805211919] Exception while running extract_user_id: KeyError('content_owner_id_new')
[2849538805211919] Extract method extract_video didn't return anything
[2849538805211919] Extract method extract_video_thumbnail didn't return anything
[2849538805211919] Extract method extract_video_id didn't return anything
[2849538805211919] Extract method extract_video_meta didn't return anything
[2849538805211919] Extract method extract_factcheck didn't return anything
[2849538805211919] Extract method extract_share_information didn't return anything
[2849538805211919] Extract method extract_listing didn't return anything
Fetching up to 1000000000.0 comments
  0%|                                                                                               | 0/1000000000.0 [00:00<?, ?it/s]
229 total comments

This is the code i call get posts:

def scraper(post_url_id, cookie_file):
    fullFile = open("full_comments.txt", 'w') # .txt file for LLM reference
    rawFile = open(raw_data_file(), 'w')
    csv_writer = csv.writer(rawFile) # .csv file for our reference
    csv_writer.writerow(['commenter_name', 'comment_text', 'comment_url', 'comment_timestamp']) #csv header

    # get the post (this gives a generator)
    gen = fs.get_posts(
        post_urls=[post_url_id],
        cookies=cookie_file,
        options={"comments": True, "progress": True, "allow-extra-requests": False}
    )

    # take 1st element of the generator which is the post we requested
    post = next(gen)

    # extract the comments part
    comments = post['comments_full']
    # print(post['comments_full'])

    print(post['comments'])

    for comment in comments:
        # generate comment string for txt file
        #print(comment['comment_text'])
        text = "'" + comment['comment_text'] + "' at " + str(comment['comment_time'])
        text = text.replace('\n', ' ')  # remove all hard returns from string
        fullFile.write(text + "\n")
        #print(comment['comment_time'])

        # insert raw data to csv file
        csv_writer.writerow([
            comment['commenter_name'],
            comment['comment_text'],
            comment['comment_url'], 
            comment['comment_time'] 
        ])

    fullFile.close()