kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.29k stars 616 forks source link

Some post texts are not scraped #323

Open rezemika opened 3 years ago

rezemika commented 3 years ago

Hi! I try to scrape posts from many pages for a research project, and some post texts are not scraped, especially the last ones in a page.

For example, here is the CSV line I get for this post: https://www.facebook.com/action.street.medics.rennes/posts/101587394720324

post_id,text,post_text,shared_text,time,image,image_lowquality,images,images_description,images_lowquality,images_lowquality_description,video,video_duration_seconds,video_height,video_id,video_quality,video_size_MB,video_thumbnail,video_watches,video_width,likes,comments,shares,post_url,link,user_id,username,user_url,is_live,factcheck,shared_post_id,shared_time,shared_user_id,shared_username,shared_post_url,available,comments_full,reactors,w3_fb_url,reactions,reaction_count
101587394720324,,,,2020-01-13 15:02:15,https://scontent-cdg2-1.xx.fbcdn.net/v/t1.6435-9/81869263_101273701418360_3350855878575128576_n.png?_nc_cat=100&ccb=1-3&_nc_sid=8024bb&_nc_ohc=44sJhcepYY4AX8Ddzzb&_nc_ht=scontent-cdg2-1.xx&oh=043b0431843294b8972c909eddacd6f7&oe=60DC2078,https://scontent-cdg2-1.xx.fbcdn.net/v/t1.6435-0/p320x320/81869263_101273701418360_3350855878575128576_n.png?_nc_cat=100&ccb=1-3&_nc_sid=8024bb&_nc_ohc=44sJhcepYY4AX8Ddzzb&_nc_ht=scontent-cdg2-1.xx&tp=30&oh=fae30daaf6e032bb36e19123b7633622&oe=60DD7A77,['https://scontent-cdg2-1.xx.fbcdn.net/v/t1.6435-9/81869263_101273701418360_3350855878575128576_n.png?_nc_cat=100&ccb=1-3&_nc_sid=8024bb&_nc_ohc=44sJhcepYY4AX8Ddzzb&_nc_ht=scontent-cdg2-1.xx&oh=043b0431843294b8972c909eddacd6f7&oe=60DC2078'],['Aucune description de photo disponible.'],['https://scontent-cdg2-1.xx.fbcdn.net/v/t1.6435-0/p320x320/81869263_101273701418360_3350855878575128576_n.png?_nc_cat=100&ccb=1-3&_nc_sid=8024bb&_nc_ohc=44sJhcepYY4AX8Ddzzb&_nc_ht=scontent-cdg2-1.xx&tp=30&oh=fae30daaf6e032bb36e19123b7633622&oe=60DD7A77'],['Aucune description de photo disponible.'],,,,,,,,,,0,2,0,https://facebook.com/action.street.medics.rennes/posts/101587394720324,,100391344839929,Street Medics Rennes,https://facebook.com/action.street.medics.rennes/?__tn__=C-R,False,,,,,,,True,,,,,

As another example, here is the result for this post: https://facebook.com/actionmedic44/posts/1339111329559911

post_id,text,post_text,shared_text,time,image,image_lowquality,images,images_description,images_lowquality,images_lowquality_description,video,video_duration_seconds,video_height,video_id,video_quality,video_size_MB,video_thumbnail,video_watches,video_width,likes,comments,shares,post_url,link,user_id,username,user_url,is_live,factcheck,shared_post_id,shared_time,shared_user_id,shared_username,shared_post_url,available,comments_full,reactors,w3_fb_url,reactions,reaction_count
1339111329559911,,,,2019-02-23 19:52:03,,,[],[],[],[],,,,,,,,,,0,2,0,https://facebook.com/actionmedic44/posts/1339111329559911,,1338471962957181,Action Medic 44 - Street-medics et secouristes,https://facebook.com/actionmedic44/?__tn__=C-R,False,,,,,,,True,,,,,

I'm trying to understand how the parsing is handled (I haven't done Python for a while), don't hesitate if I can help you!

neon-ninja commented 3 years ago

In order to get the full text for these longer posts, the scraper needs to "click" on the post, which in some cases requires a login, even for public posts. Are you passing cookies or credentials?

rezemika commented 3 years ago

Oh, I didn't knew that, I'm sorry! So I think it's okay, but strangely I get a LoginError when I try with credentials, so I can't confirm it. I think I can close the issue now...

neon-ninja commented 3 years ago

Try pass cookies instead of credentials

mujeebcpy commented 3 years ago

i have the same issue. im passing cookies. i didn't get a login error. im trying to access content From Msonepage facebook page. but the result is msonefb

neon-ninja commented 3 years ago

This is working fine for me, the code

for post in get_posts("Msonepage", pages=4):
    print(post["post_id"], len(post["text"]), post["time"])

outputs

4291168727602719 1621 2021-06-19 18:08:23
4289245764461682 2076 2021-06-19 00:56:06
4288623407857251 1287 2021-06-18 19:33:07
4286718564714402 1445 2021-06-18 01:57:01
4286573484728910 1612 2021-06-18 00:57:27
4286358808083711 1089 2021-06-17 23:19:30
4286150838104508 1406 2021-06-17 21:31:05
4285963538123238 1704 2021-06-17 19:53:26
4285766488142943 1702 2021-06-17 17:50:41
4282950005091258 1083 2021-06-16 16:51:45
4281111508608441 1260 2021-06-16 00:47:32
4280365558683036 1184 2021-06-15 17:36:44
4278535115532747 1324 2021-06-15 00:54:21
4277809098938682 947 2021-06-14 18:03:47

Do you get any locale warnings when you run the scraper? Please try enable logging (if you're using the CLI, with the -v, --verbose argument) and post the logs

mujeebcpy commented 3 years ago

it is weird.. now works without any problems. but now i get only one response. before i was getting 4 posts. how to increase the number of posts? facebook-scraper --verbose --filename mu_page_posts.csv --pages 20 Msonepage -c cookies.txt


  warnings.warn(f"Locale detected as {locale} - for best results, set to en_US")
[4291918264194432] Exception while running extract_text: AttributeError("'NoneType' object has no attribute 'find'")
[4291918264194432] Extract method extract_link didn't return anything
[4291918264194432] Extract method extract_video didn't return anything
[4291918264194432] Extract method extract_video_thumbnail didn't return anything
[4291918264194432] Extract method extract_video_id didn't return anything
[4291918264194432] Extract method extract_video_meta didn't return anything
[4291918264194432] Extract method extract_factcheck didn't return anything
[4291918264194432] Extract method extract_share_information didn't return anything
[4291918264194432] Extract method extract_listing didn't return anything
[4291742574212001] Extract method extract_video didn't return anything
[4291742574212001] Extract method extract_video_thumbnail didn't return anything
[4291742574212001] Extract method extract_video_id didn't return anything
[4291742574212001] Extract method extract_video_meta didn't return anything
[4291742574212001] Extract method extract_factcheck didn't return anything
[4291742574212001] Extract method extract_share_information didn't return anything
[4291742574212001] Extract method extract_listing didn't return anything
[4291474120905513] Extract method extract_video didn't return anything
[4291474120905513] Extract method extract_video_thumbnail didn't return anything
[4291474120905513] Extract method extract_video_id didn't return anything
[4291474120905513] Extract method extract_video_meta didn't return anything
[4291474120905513] Extract method extract_factcheck didn't return anything
[4291474120905513] Extract method extract_share_information didn't return anything
[4291474120905513] Extract method extract_listing didn't return anything ```
verbose is here.
mujeebcpy commented 3 years ago

some times i get this error.

Exception: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='m.facebook.com', port=443): Read timed out. (read timeout=5)"))

also i set en_US in cookie settings from firefox plugin. still showing locale warning

neon-ninja commented 3 years ago

This commit (https://github.com/kevinzg/facebook-scraper/commit/320d81189e4c6c5023397c93bd02543ea36f1d05) should make it possible to pass a timeout via CLI. The language would be set on your account from https://www.facebook.com/settings?tab=language&section=account&view, the cookie with the name locale is ignored by Facebook now. You would need to re-export your cookies after changing language. Do you have a cookie called noscript? This might be causing the problem.