kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.44k stars 633 forks source link

Facebook Scraping [doubt] #276

Closed fashan7 closed 3 years ago

fashan7 commented 3 years ago

Hi @neon-ninja I passed Netscape cookies and ran the facebook-scraper using the console. It gave good JSON for this post http://facebook.com/1610364829101773. and later I again ran after an hour back. It didn't return reactions

`"reactors":"None",
      "w3_fb_url":"None",
      "reactions":"None",
      "reaction_count":"None"`

even commenter URL is none

I feel something wrong when scraping.

neon-ninja commented 3 years ago

What steps did you take to cause this problem? Were you scraping heavily during that hour? Perhaps at this sort of scale, you should keep records of which posts failed to extract reactions, and come back to backfill them later with get_posts(post_urls=[..]), after whatever temporary block has worn off?

fashan7 commented 3 years ago

Basically when we pass cookie It actually logged in to FB and scrape the post So when the proxy is slow page loads for a while to give a complete rendered page While pages are rendering it will scrape the result before rendering and returns with a bad result This what I think basically @neon-ninja Am I correct?

Couldn't we wait until page/post loads 100% and then extract all data regarding that post?

neon-ninja commented 3 years ago

That's not how it works - requests.get is a blocking operation - it doesn't return until the entire request is complete. It's all or nothing. If it's slow you might hit a timeout error though - requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='m.facebook.com', port=443): Read timed out. (read timeout=0.1)

fashan7 commented 3 years ago

If that's the case, why is not returning commenter URL "reactors":"None", "w3_fb_url":"None", "reactions":"None", "reaction_count":"None"

http://facebook.com/1610364829101773

Note: Btw I'm passing private proxy

fashan7 commented 3 years ago

@neon-ninja if u needed a private proxy, I can share with u with the cookies

neon-ninja commented 3 years ago

Sure, post your proxy details. This works fine for me with my private proxy btw:

from facebook_scraper import *
import logging
enable_logging(logging.DEBUG)
set_proxy("squid.auckland.ac.nz:3128")
for post in get_posts(post_urls=[1610364829101773], cookies="cookies.txt", options={"reactions": True}):
    print(post.get("reactions"))

output:

Proxy details: {'ip': '130.216.156.173', 'ip_decimal': 2195233965, 'country': 'New Zealand', 'country_iso': 'NZ', 'country_eu': False, 'latitude': -41, 'longitude': 174, 'time_zone': 'Pacific/Auckland', 'asn': 'AS9431', 'asn_org': 'The University of Auckland', 'hostname': 'squidproxy-f5vip.auckland.ac.nz', 'user_agent': {'product': 'Mozilla', 'version': '5.0', 'comment': '(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36', 'raw_value': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36'}}
Requesting page from: https://m.facebook.com/1610364829101773
Fetching https://m.facebook.com/pgo.gov.ua/photos/pcb.137544304427389/137543791094107/?type=3&source=48&refid=52&__tn__=EHH-R
Fetching https://m.facebook.com/pgo.gov.ua/photos/pcb.137544304427389/137543827760770/?type=3&source=48&refid=52&__tn__=EHH-R
Fetching https://m.facebook.com/pgo.gov.ua/photos/pcb.137544304427389/137543917760761/?type=3&source=48&refid=52&__tn__=EHH-R
Fetching https://m.facebook.com/pgo.gov.ua/photos/pcb.137544304427389/137544181094068/?type=3&source=48&refid=52&__tn__=EHH-R
Fetching https://m.facebook.com/story.php?story_fbid=1610364829101773&id=365331280271807
[1610364829101773] Extract method extract_video_meta didn't return anything
[1610364829101773] Extract method extract_factcheck didn't return anything
1610364829101773 is a share of 137544304427389
data-ft attribute not found
{'like': 115, 'love': 4, 'haha': 13, 'wow': 7, 'care': 1, 'angry': 1}
fashan7 commented 3 years ago

@neon-ninja please check ur mail.

neon-ninja commented 3 years ago

The proxy is fine, the problem is your cookies:

from facebook_scraper import _scraper
from facebook_scraper import *
for file in ["top.txt", "produc_cookies.json", "newjson.json", "cookies.txt", "cookies.json"]:
    set_cookies(file)
    print(file, _scraper.is_logged_in())

returns

top.txt False
produc_cookies.json True
newjson.json False
cookies.txt True
cookies.json True
fashan7 commented 3 years ago

IC, @neon-ninja But I directly exported Netscape cookies from this https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/

neon-ninja commented 3 years ago

Did you logout of facebook in that session before or after exporting cookies?

fashan7 commented 3 years ago

Is there any way to export working cookies What I did was. Before login Into Facebook. I cleared history and then logged In and exported the cookies

neon-ninja commented 3 years ago

The original cookies you sent me (with filename produc_cookies.json) are still valid, why not just use those?

fashan7 commented 3 years ago
from facebook_scraper import _scraper
from facebook_scraper import *
for file in ["top.txt", "produc_cookies.json", "newjson.json", "cookies.txt", "cookies.json"]:
    set_cookies(file)
    print(file, _scraper.is_logged_in())

this code is really helpfull

neon-ninja commented 3 years ago

This commit (https://github.com/kevinzg/facebook-scraper/commit/9af15d86c26b76357e2f72198a955ac59631a558) will make it so that the scraper throws an exception if you pass invalid cookies

fashan7 commented 3 years ago

@neon-ninja how to make cookies, not to expire quickly

fashan7 commented 3 years ago

@neon-ninja can u share me the html, when logging using my cookies. is it possible to share with via email please

neon-ninja commented 3 years ago

They don't expire quickly, the expiry is like 1 year away. produc_cookies.json is still valid, what did you do differently with those compared to say, top.txt?

fashan7 commented 3 years ago

@neon-ninja top.txt is an account that is 2FA authenticated account.

neon-ninja commented 3 years ago

Maybe that's the problem?

fashan7 commented 3 years ago

if we figured it out means we are good. by getting the HTML response which is preventing from logging. is it possible to share me the file @neon-ninja

neon-ninja commented 3 years ago

Why do you need me to extract html for you when you can just as easily do it yourself?

fashan7 commented 3 years ago

ok, can i know where is the place to put an debug print @neon-ninja

neon-ninja commented 3 years ago

I'm not sure I understand - what do you want the HTML for? Of the 3 cookie files you've sent me, which are you referring to? Assuming you're referring to top.txt, it's just the standard facebook login page. Basically, you send the cookie to the facebook server, the server replies to tell your browser (or in this case, Python) to trash those cookies, as they're not valid, and sends you the login page HTML

fashan7 commented 3 years ago

I have removed the 2FA from this account, which I will send to u via mail. please check. @neon-ninja

neon-ninja commented 3 years ago

What do you get when you run those cookies yourself?

fashan7 commented 3 years ago
`>>> for file in ["cookies3.txt"]:
...     set_cookies(file)
...     print(file, _scraper.is_logged_in())
... 
cookies3.txt False`

@neon-ninja

neon-ninja commented 3 years ago

Then that doesn't seem to have helped. Maybe Facebook have somehow flagged your account, such that any time you try connect from a new IP, you're forced to log in again

fashan7 commented 3 years ago

@neon-ninja Hi Is it possible, send a username and password which consists of 2FA A/C? when logged in Facebook will ask the code of the 2FA and passing the code obtained from authentication API/App and set when it required

neon-ninja commented 3 years ago

Also note that you don't technically need facebook-scraper to observe this behaviour, you can just use curl like so:

curl --silent --head --cookie cookies3.txt https://facebook.com/settings|grep cookie
set-cookie: c_user=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=-1621542934; path=/; domain=.facebook.com
set-cookie: spin=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=-1621542934; path=/; domain=.facebook.com; httponly
set-cookie: xs=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=-1621542934; path=/; domain=.facebook.com; httponly

See how facebook says to delete those invalid cookies?

neon-ninja commented 3 years ago

https://github.com/kevinzg/facebook-scraper/commit/18d9d539cb8fc95ac527027f81289071a0423b31 this commit should make it possible to enter your 2FA token on the command line