kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.47k stars 635 forks source link

INFO:facebook_scraper.page_iterators:Page parser did not find next page URL #366

Open maorkavod opened 3 years ago

maorkavod commented 3 years ago

Hi, The NPM package for Facebook Scraper has been updated to version 0.23.43. Since then, I have not scrapped any group. The script just stops when I see this message with debug mode:

INFO:facebook_scraper.page_iterators:Page parser did not find next page URL

What could be the cause of this? I use 100 proxy servers and Facebook cookies. Everything went well yesterday.

I appreciate your help.

neon-ninja commented 3 years ago

What group are you having this problem with?

print(len(list(get_posts(group=117507531664134, cookies="cookies.txt", options={"allow_extra_requests": False}))))

returns 357, so this groups is working fine for me

maorkavod commented 3 years ago

@neon-ninja

I am trying to crawl this Facebook group:

https://www.facebook.com/groups/287564448778602/

maorkavod commented 3 years ago

@neon-ninja This is happening to me in every group now :-/ I will try to provide a full debug log here.

neon-ninja commented 3 years ago

Try without cookies? Or try join your scraping account into the group

maorkavod commented 3 years ago

My localhost appears to be working fine. When I try to run it from my server, there is a problem. Error message : INFO:facebook_scraper.page_iterators:Page parser did not find next page URL

I use 100 proxy servers, paid servers that work well and are fast. I replace the proxy address with every iteration. When debug shows me the error above, the code just steps out of the function and returns zero posts. Next, I will investigate what is happening on the server..

maorkavod commented 3 years ago

@neon-ninja It worked today on several groups. I've replaced my IP address and useragent with cookies and without cookies. I'm still getting the same error, and it seems the code just exists after the "warning" in the debug.

Would it be possible for you to try crawling this page? https://www.facebook.com/groups/287564448778602/

maorkavod commented 3 years ago

@neon-ninja

HI, Now I can reproduce it on my local host as well. In the debug, I can see all posts crawled as follows:

Screen Shot 2021-06-24 at 21 57 06

Unfortunately, the function simply returns an object of this weird type , with no posts.

Screen Shot 2021-06-24 at 21 54 42

neon-ninja commented 3 years ago

I can see the error in your screenshot, you're temporarily banned. get_posts is supposed to return a generator, this is normal.

maorkavod commented 3 years ago

@neon-ninja

I found that the URL that the scrapper package sends to get images hrefs keeps getting blocked. I will try to find an alternative method to get the image href.

neon-ninja commented 3 years ago

The temporary ban is caused by excessive scraping. Either scrape less or slower. Alternatively just disable extra requests and then the scraper won't even try to get the high quality image

maorkavod commented 3 years ago

@neon-ninja

I need only the text and image, as well as the post id. If I pass these options:

options = { "comments": False, "reactors": False, "reactions": False, "allow_extra_requests": False, }

Will I still get the properties I need?

neon-ninja commented 3 years ago

You would get low quality images, and longer text posts might be cut off

maorkavod commented 3 years ago

In the get_posts function, I passed the next options:

options = { "comments": False, "reactors": False, "reactions": False, "allow_extra_requests": False, }

I can't seem to get the image, everything else works fine. Here's the group URL: https://m.facebook.com/groups/Pishpeshuk.Nadlan

It would be great if you could test it out and see if you receive any images, would that be okay with you? That may be different from the temporarily blocked content by Facebook now rather than a diffrent html issue?

neon-ninja commented 3 years ago

Yes, it works fine for me. This code:

pprint(next(get_posts(group="Pishpeshuk.Nadlan", options={"allow_extra_requests": False})))

returns

{'available': True,
 'comments': 0,
 'comments_full': None,
 'factcheck': None,
 'image': None,
 'image_lowquality': 'https://scontent.fakl1-2.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/105610023_3682686048416784_7502958421092527351_n.jpg?_nc_cat=109&ccb=1-3&_nc_sid=ca434c&_nc_ohc=GcadebOLb5UAX9IjgIs&_nc_ht=scontent.fakl1-2.fna&tp=14&oh=b2e5736d4f5d51be28780a439874160c&oe=60DDE0F1',
 'images': None,
 'images_description': None,
 'images_lowquality': ['https://scontent.fakl1-2.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/105610023_3682686048416784_7502958421092527351_n.jpg?_nc_cat=109&ccb=1-3&_nc_sid=ca434c&_nc_ohc=GcadebOLb5UAX9IjgIs&_nc_ht=scontent.fakl1-2.fna&tp=14&oh=b2e5736d4f5d51be28780a439874160c&oe=60DDE0F1'],
 'images_lowquality_description': ['No photo description available.'],
 'is_live': False,
 'likes': 25,
 'link': None,
 'post_id': '3009648419116144',
 'post_text': 'ברוכים הבאים לפשפשוק דירות-להשכרה/מכירה , בקבוצה ניתן לפרסם '
              'דירות למכירה/להשכרה, סאבלט ושותפים בכל הארץ, דירות גן, '
              "פנטהאוזים, דירות יוקרה, מחסנים, בתים, וילות , קוטג'ים ונכסים "
              'עסקיים.\n'
              '\n'
              '*מתווכים שימו לב - מתווכים מורשים לפרסם מודעה אחת ביום בלבד, '
              'במידה ומתווך יפרסם מעל מודעה אחת - הוא יוסר מהלוח.\n'
              '\n'
              '*אל תחסכו בפרטים - כתבו האם מדובר במכירה/השכרה, סוג הנכס, אזור '
              'הנכס, מצב הנכס ועוד פרטים שיוכלו לעזור לרוכשים.\n'
              '\n'
              '*אנחנו נגד ספאם - פרסומים שלא קשורים לרוח הקבוצה, יוסרו והגולש '
              'יורחק לצמיתות.\n'
              '\n'
              '*יש להקפיד על פרסום תמונות + מחיר למען נוחיות הגולשים .\n'
              '\n'
              'גלישה נעימה ופשפוש מהנה לכולם,\n'
              '\n'
              'פשפשוק דירות להשכרה/מכירה',
 'post_url': 'https://m.facebook.com/groups/Pishpeshuk.Nadlan/permalink/3009648419116144/',
 'reaction_count': None,
 'reactions': None,
 'reactors': None,
 'shared_post_id': None,
 'shared_post_url': None,
 'shared_text': '',
 'shared_time': None,
 'shared_user_id': None,
 'shared_username': None,
 'shares': 1,
 'text': 'ברוכים הבאים לפשפשוק דירות-להשכרה/מכירה , בקבוצה ניתן לפרסם דירות '
         'למכירה/להשכרה, סאבלט ושותפים בכל הארץ, דירות גן, פנטהאוזים, דירות '
         "יוקרה, מחסנים, בתים, וילות , קוטג'ים ונכסים עסקיים.\n"
         '\n'
         '*מתווכים שימו לב - מתווכים מורשים לפרסם מודעה אחת ביום בלבד, במידה '
         'ומתווך יפרסם מעל מודעה אחת - הוא יוסר מהלוח.\n'
         '\n'
         '*אל תחסכו בפרטים - כתבו האם מדובר במכירה/השכרה, סוג הנכס, אזור הנכס, '
         'מצב הנכס ועוד פרטים שיוכלו לעזור לרוכשים.\n'
         '\n'
         '*אנחנו נגד ספאם - פרסומים שלא קשורים לרוח הקבוצה, יוסרו והגולש יורחק '
         'לצמיתות.\n'
         '\n'
         '*יש להקפיד על פרסום תמונות + מחיר למען נוחיות הגולשים .\n'
         '\n'
         'גלישה נעימה ופשפוש מהנה לכולם,\n'
         '\n'
         'פשפשוק דירות להשכרה/מכירה',
 'time': datetime.datetime(2020, 6, 26, 6, 11, 47),
 'user_id': '407003609318394',
 'user_url': 'https://facebook.com/pishpeshuk.co.il/?refid=18&_ft_=top_level_post_id.3009648419116144%3Acontent_owner_id_new.407003609318394%3Apage_id.407003609318394%3Astory_location.6%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX9SGtYoZmErvghw%3Apage_insights.%7B%22407003609318394%22%3A%7B%22page_id%22%3A407003609318394%2C%22page_id_type%22%3A%22page%22%2C%22actor_id%22%3A407003609318394%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntGroupMallPostCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A657%2C%22publish_time%22%3A1593108707%2C%22story_name%22%3A%22EntGroupMallPostCreationStory%22%2C%22story_fbid%22%3A%5B3009648419116144%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A6%2C%22targets%22%3A%5B%7B%22actor_id%22%3A407003609318394%2C%22page_id%22%3A407003609318394%2C%22post_id%22%3A3009648419116144%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%2C%22303803599700653%22%3A%7B%22page_id%22%3A303803599700653%2C%22page_id_type%22%3A%22group%22%2C%22actor_id%22%3A407003609318394%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntGroupMallPostCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A657%2C%22publish_time%22%3A1593108707%2C%22story_name%22%3A%22EntGroupMallPostCreationStory%22%2C%22story_fbid%22%3A%5B3009648419116144%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A6%7D%7D&__tn__=C-R',
 'username': 'פשפשוק',
 'video': None,
 'video_duration_seconds': None,
 'video_height': None,
 'video_id': None,
 'video_quality': None,
 'video_size_MB': None,
 'video_thumbnail': None,
 'video_watches': None,
 'video_width': None,
 'w3_fb_url': None}
maorkavod commented 3 years ago

The property works great! I appreciate your help. Compared to paid services such as proxycrawel, your facebook scrapper works better :)