moda20 / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
77 stars 28 forks source link

Suddenly unable to retrieve posts but not being blocked #27

Closed AndyHsieh1020 closed 6 months ago

AndyHsieh1020 commented 9 months ago

for post in get_posts('TWCDC', base_url="https://mbasic.facebook.com", start_url="https://mbasic.facebook.com/profile/timeline/stream/?cursor=AQHROdC7Jn1-OApvu8SQfCQpFFiRRjrjsQT3SlUWDXpuy5vvCHkneN1L9k74Dva6jFdo-IDhnhcc2xLOYKQ9aRQcsWLPbMcJC7wEclsUks5oos16Hiw8SwXqdrc-eIC5YEUaNTssisRsD6SdAZ6DB5aaIsv1nxc7bXiPF8wM30zRndYSz0xO6S8OTXD93inS40Px2bU4-L1U0ocHVLZjBNb7oBLvSIdmjdBM0SFzeQQjGPBcEWUJv2LkWy4RO1C0TsYCpw4ddMPM0s8q7yneb9bkk6SLyji-zRRWcyYJ5haFFFOF5uMM34Fj5Fh7b8QNE7SViy4bIVIC8JvwhDVQdelhis7wSJL0z_hAbZvRVV8x6yf3fxhpbYlPt88Axbe2eB1TDlwWCypAPQqrchbIvMk32I5pZ8ZR5vmMhZXjMeOe4nayYSjoB4wmFTIc6akN30P8HOmxnkqXDsAKLCaxkDsPHxyecc3b93DJaQJ_Vx0dqoIs-nt1C9ITx6oE0JNiBxXdbOi-C5iCd7vGTd3knTaqvZpnnLcZVDWoPFUg-ylPsUuQuXsRC6589947_CNcCZ-_c0uWncuydI65SyVRkCOCD_wtDWBjmX8-rCJRCMkEGC0X5fKukUjilAqyvZ1Cvd3gLDc4Yo-Hdswz3is1XGPw-URfm6mSsu328Xm9ekVNQonlMlwtBpBNWGMs1cYx2eBqQK7VJAl17UoNCH07KW7mSZksk2IAjOVzStys6vizdhk&profile_id=100064778138289&replace_id=u_0_0_Hq&paipv=0&eav=AfYMi2pHMfREEwXfCtoHIm6LuTlsytfbXaKlJq3ZwCrBK7sGPqo4MeB4VwxhoRTjXfk", pages=50, cookies = "www.facebook.com_cookies2.txt", options={"reactors": True, "posts_per_page": 200}):

if 'full_text' in post:
    post_txt=post['full_text'] 
else:
    post_txt=post['text']

if 'reactions' in post:
    reactions=post['reactions']
else:
    reactions=""

post_time=post['time']

my_post = { "time": post_time, "post": post_txt, "reactions": reactions }
col_posts.insert_one(my_post)

POST_ID = str(post['post_id'])

gen = get_posts(
    post_urls=[POST_ID],
    options={"comments": MAX_COMMENTS, "progress": True}
)

post_cmm = next(gen)

comments = post_cmm['comments_full']

for comment in comments:

    comment_txt=comment['comment_text']
    comment_time=comment['comment_time']
    my_comment = { "post_id": post['post_id'], "time": comment_time, "post": comment_txt }
    col_comments.insert_one(my_comment)
    time.sleep(random.randint(10,60))

time.sleep(random.randint(10,180))

My program, as mentioned above, after several successful runs, initially fetching the specified number of pages. Suddenly encounters issues in later runs, It either completes only one page or fails to retrieve posts from the page sometimes. Another strange thing happend is that when I encounter a situation where posts cannot be obtained, the 'start_url' link does not display any posts(like picture below), logging messages like below: image DEBUG:facebook_scraper.page_iterators:Got 0 raw posts from page DEBUG:facebook_scraper.facebook_scraper:Extracting posts from page 0 DEBUG:facebook_scraper.page_iterators:Looking for next page URL INFO:facebook_scraper.page_iterators:Page parser did not find next page URL

526319491 commented 9 months ago

你的start_url 是什么情况,不是“https://mbasic.facebook.com/xxx?v=timeline” 吗,怎么这么长

AndyHsieh1020 commented 9 months ago

因为我的起始url是从其他地方开始,不太确定能不能这么用 my start url is started from somewhere else (not from beginning of fanpage), not quite sure if it can work that way

moda20 commented 6 months ago

@AndyHsieh1020 do you still have this issue ? were you able to solve it ? if not please reopen this issue