kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.37k stars 626 forks source link

How to implmented proxy api with facebook-scraper #337

Closed fashan7 closed 3 years ago

fashan7 commented 3 years ago

Hi @neon-ninja How can i implement proxy using third party proxy provider https://www.scraperapi.com/documentation/ please kindly visit this link. Can u guide me how it works

I am trying like this but it gives error set_proxy("http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001")

es/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc3fd6017d0>: Failed to establish a new connection: [Errno 60] Operation timed out
fashan7 commented 3 years ago

@neon-ninja why is it coming like this

`/django_facebook/lib/python3.7/site-packages/facebook_scraper/facebook_scraper.py:270: UserWarning: Facebook served mbasic/noscript content unexpectedly on http://ipinfo.io/
  warnings.warn(f"Facebook served mbasic/noscript content unexpectedly on {response.url}")`

when setting proxy

neon-ninja commented 3 years ago

Disregard that, it's intended to warn about a possible error on Facebook so doesn't make sense off Facebook

fashan7 commented 3 years ago

Hi @neon-ninja How can i implement proxy using third party proxy provider https://www.scraperapi.com/documentation/ please kindly visit this link. Can u guide me how it works

I am trying like this but it gives error set_proxy("http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001")

es/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc3fd6017d0>: Failed to establish a new connection: [Errno 60] Operation timed out

@neon-ninja any idea regarding this

fashan7 commented 3 years ago

@neon-ninja can we set proxies in facebook-scraper session = HTMLSession(browser_args=["--no-sandbox","--proxy-server=http:%s"%self.proxy_ip])

neon-ninja commented 3 years ago
set_proxy("http://lum-customer-hl_f6af7b92-zone-data_center:REDACTED_PASSWORD@zproxy.lum-superproxy.io:22225")

seems to work fine, outputs

Proxy details: {'ip': '178.171.41.23', 'city': 'Bucharest', 'region': 'Bucureşti', 'country': 'RO', 'loc': '44.4323,26.1063', 'org': 'AS9009 M247 Ltd', 'postal': '020011', 'timezone': 'Europe/Bucharest', 'readme': 'https://ipinfo.io/missingauth'}
neon-ninja commented 3 years ago

No worries, I've edited my comment & yours to redact the password

fashan7 commented 3 years ago

@neon-ninja thanks

neon-ninja commented 3 years ago

@neon-ninja can we set proxies in facebook-scraper session = HTMLSession(browser_args=["--no-sandbox","--proxy-server=http:%s"%self.proxy_ip])

This one doesn't make any sense, facebook-scraper isn't a browser, and it doesn't have a sandbox mode. This would make more sense as arguments to the likes of Google Chrome

fashan7 commented 3 years ago

@neon-ninja the thing is this. I want to visit as an anonymous user to collect the header. but even when I set proxy it redirects to the login page. So what I did was I used the above credentials and set in the proxy bonanza and it perfectly can visit the page but when I visit via the script it redirects to the login page. I was wondering what's wrong with the set_proxy.

neon-ninja commented 3 years ago

There's nothing wrong with set_proxy, Facebook can tell you're using a proxy because probably hundreds of thousands or possibly even millions of other people/bots/scripts are using those exact IPs to scrape Facebook. So Facebook insists you login. I can tell the proxy is working because I get a Romanian page title ("Facebooka daxil ol.") served from Facebook:

Proxy details: {'ip': '178.171.44.58', 'city': 'Bucharest', 'region': 'Bucureşti', 'country': 'RO', 'loc': '44.4323,26.1063', 'org': 'AS9009 M247 Ltd', 'postal': '020011', 'timezone': 'Europe/Bucharest', 'readme': 'https://ipinfo.io/missingauth'}
Starting to iterate pages
<!DOCTYPE html><html lang="az"><head><title>Facebooka daxil ol. | Facebook</title>

Facebooka daxil ol means Sign in to Facebook in Romanian.

I also get locale warnings like this one UserWarning: Locale detected as km_KH - for best results, set to en_US

neon-ninja commented 3 years ago

I would recommend scraping through tor

fashan7 commented 3 years ago

I would recommend scraping through tor

can u please elloborate this

fashan7 commented 3 years ago

@neon-ninja have any idea regarding this

/lib/python3.7/site-packages/requests/adapters.py", line 510, in send
    raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='m.facebook.com', port=443): Max retries exceeded with url: /182243201819458/posts/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Bad Gateway')))
neon-ninja commented 3 years ago

@neon-ninja have any idea regarding this

/lib/python3.7/site-packages/requests/adapters.py", line 510, in send
    raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='m.facebook.com', port=443): Max retries exceeded with url: /182243201819458/posts/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Bad Gateway')))

Your proxy doesn't work

neon-ninja commented 3 years ago

I would recommend scraping through tor

can u please elloborate this

It's relatively easy in Ubuntu.

sudo apt install tor
torsocks curl https://api.ipify.org
fashan7 commented 3 years ago

I would recommend scraping through tor

can u please elloborate this

It's relatively easy in Ubuntu.

sudo apt install tor
torsocks curl https://api.ipify.org

what is this for? purpose!

neon-ninja commented 3 years ago

I would recommend scraping through tor

can u please elloborate this

It's relatively easy in Ubuntu.

sudo apt install tor
torsocks curl https://api.ipify.org

what is this for? purpose!

https://en.wikipedia.org/wiki/Tor_(network)

fashan7 commented 3 years ago

@neon-ninja so if we need to go Anonymous, we need to use tor. right?. so do we need to set proxy for the script?

neon-ninja commented 3 years ago

Depends if you use torsocks or not

fashan7 commented 3 years ago

@neon-ninja im currently using ubuntu for the scraping part.

fashan7 commented 3 years ago

I would recommend scraping through tor

can u please elloborate this

It's relatively easy in Ubuntu.

sudo apt install tor
torsocks curl https://api.ipify.org

sudo apt install torsocks

fashan7 commented 3 years ago

@neon-ninja after installing

sudo apt install tor
sudo apt install torsocks

what should i need to do?

neon-ninja commented 3 years ago

Make sure the tor service is running. Check it works with torsocks curl https://api.ipify.org. If your script is called script.py, try torsocks python script.py

fashan7 commented 3 years ago

@neon-ninja Do i need to set proxy for this if i am running Torsocks.

neon-ninja commented 3 years ago

No, torsocks intercepts all network traffic made by the wrapped command and routes it through tor. But you can set the proxy instead of using torsocks if you prefer, just google for instructions for configuring tor with Python requests

fashan7 commented 3 years ago

No, torsocks intercepts all network traffic made by the wrapped command and routes it through tor. But you can set the proxy instead of using torsocks if you prefer, just google for instructions for configuring tor with Python requests

https://sylvaindurand.org/use-tor-with-python/ this may help right!

fashan7 commented 3 years ago

@neon-ninja can we pass user agent like set_proxy in the facebook-scraper headers = { 'User-Agent': UserAgent().random }

neon-ninja commented 3 years ago

Yep, looks good. Sure, you can use the set_user_agent function

fashan7 commented 3 years ago

@neon-ninja according to this instruction https://sylvaindurand.org/use-tor-with-python/ I have implemented like this

from facebook_scraper import *
from stem import Signal
from stem.control import Controller

proxies = {
    'http': 'socks5://127.0.0.1:9050',
    'https': 'socks5://127.0.0.1:9050'
}
with Controller.from_port(port = 9051) as c:
    c.authenticate()
    c.signal(Signal.NEWNYM)
response = requests.get('https://api.ipify.org', proxies=proxies).text

my_proxy = "http://"+response+":9051"
set_proxy(my_proxy)

But the result is this HTTPSConnectionPool(host='m.facebook.com', port=443): Max retries exceeded with url: /182243201819458/posts/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5ca7e219a0>: Failed to establish a new connection: [Errno 111] Connection refused')))

Note: Actually proxy is getting generated randomly and I have set with port 9051 my_proxy = "http://"+response+":9051"

neon-ninja commented 3 years ago

That doesn't make sense. You should use set_proxy("socks5://127.0.0.1:9050") if you're not using torsocks

fashan7 commented 3 years ago

@neon-ninja
what if i use like this torsocks python script.py

when I use above torsocks the database connection is getting error, since its localhost

WARNING torsocks[2047360]: [connect] Connection to a local address are denied since it might be a TCP DNS query to a local DNS server. Rejecting it for safety reasons. (in tsocks_connect() at connect.c:193)
Database login error
1623747510 WARNING torsocks[2047360]: [connect] Connection to a local address are denied since it might be a TCP DNS query to a local DNS server. Rejecting it for safety reasons. (in tsocks_connect() at connect.c:193)
fashan7 commented 3 years ago

set_proxy("socks5://127.0.0.1:9050")

it seems

That doesn't make sense. You should use set_proxy("socks5://127.0.0.1:9050") if you're not using torsocks

@neon-ninja Scripts says A login (cookies) is required to see this page

fashan7 commented 3 years ago

@neon-ninja any idea

`from facebook_scraper import *
>>> import requests
>>> 
>>> set_proxy("socks5://127.0.0.1:9050")
>>> for post in get_posts("182243201819458", pages=1, timeout=45, options={'reactors': True}):
...     print(post)
... 
sys:1: UserWarning: A low page limit (<=2) might return no results, try increasing the limit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 447, in _generic_get_posts
    for i, page in zip(counter, iter_pages_fn()):
  File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/page_iterators.py", line 25, in iter_pages
    request_fn(start_url)
  File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 369, in get
    raise exceptions.LoginRequired(
facebook_scraper.exceptions.LoginRequired: A login (cookies) is required to see this page`
fashan7 commented 3 years ago

@neon-ninja I recommend for proxy checking

ip = self.get(“http://lumtest.com/myip.json”, headers={“Accept”: “application/json”}).json() http://lumtest.com/myip.json using this URL

fashan7 commented 3 years ago

@neon-ninja unfortunately we set facebook is asking Login even we went anonymous

Code is in this zip. run code is in the terminal is python mypython.py mypython.py.zip

Could you run it? Can we able to manage it get the data without cookies and proxies but with TOR feature

`lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py:324: UserWarning: Locale detected as id_ID - for best results, set to en_US
  warnings.warn(f"Locale detected as {locale} - for best results, set to en_US")
Traceback (most recent call last):
  File "mypython.py", line 33, in <module>
    for post in get_posts("182243201819458", pages=1, timeout=45, options={'reactors': True}):
  File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 449, in _generic_get_posts
    for i, page in zip(counter, iter_pages_fn()):
  File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/page_iterators.py", line 25, in iter_pages
    request_fn(start_url)
  File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 371, in get
    raise exceptions.LoginRequired(
facebook_scraper.exceptions.LoginRequired: A login (cookies) is required to see this page
`
neon-ninja commented 3 years ago

Well, so much for that plan. It seems Facebook can tell when you're connecting from a tor exit node, and insist you login. If they say you have to login, you have to login.

vitovt commented 3 years ago

If you run from CLI you can use: HTTPS_PROXY="super-proxy.com:3388" bash -c "/usr/local/bin/facebook-scraper --filename filename.csv --pages 10 pagename"

neon-ninja commented 3 years ago

I also noticed mbasic seems to enforce login less than m.facebook.com

neon-ninja commented 3 years ago

Try this code:

set_proxy("socks5://127.0.0.1:9050")
set_noscript(True)
for post in get_posts("Nintendo"):
    print(post["post_id"], post["time"])

with the tor service running and the latest master branch.

fashan7 commented 3 years ago

Try this code:

set_proxy("socks5://127.0.0.1:9050")
set_noscript(True)
for post in get_posts("Nintendo"):
    print(post["post_id"], post["time"])

with the tor service running and the latest master branch.

@neon-ninja if we are using with above method it gives only 4 post even if we have increased the page number.

fashan7 commented 3 years ago

@neon-ninja while visiting from mbasic.facebook.com/zanupfparty

In the description, there is no full description.

{
   "post_id":"2887921374808137",
   "text":"COVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
   "post_text":"COVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
   "shared_text":"None",
   "time":datetime.datetime(2021,
   6,
   20,
   9,
   24,
   57),
   "image":"None",
   "image_lowquality":"None",
   "images":[

   ],
   "images_description":[

   ],
   "images_lowquality":[

   ],
   "images_lowquality_description":[

   ],
   "video":"None",
   "video_duration_seconds":"None",
   "video_height":"None",
   "video_id":"None",
   "video_quality":"None",
   "video_size_MB":"None",
   "video_thumbnail":"None",
   "video_watches":"None",
   "video_width":"None",
   "likes":"None",
   "comments":"None",
   "shares":0,
   "post_url":"https://facebook.com/zanupfparty/posts/2887921374808137",
   "link":"None",
   "user_id":"1732816076985345",
   "username":"ZANU PF Party",
   "user_url":"https://facebook.com/zanupfparty/?refid=17&_ft_=mf_story_key.2887921374808137%3Atop_level_post_id.2887921374808137%3Atl_objid.2887921374808137%3Acontent_owner_id_new.1732816076985345%3Athrowback_story_fbid.2887921374808137%3Apage_id.1732816076985345%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX95RoqFVAssAwXl%3Apage_insights.%7B%221732816076985345%22%3A%7B%22page_id%22%3A1732816076985345%2C%22page_id_type%22%3A%22page%22%2C%22actor_id%22%3A1732816076985345%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1624177497%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B2887921374808137%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A1732816076985345%2C%22page_id%22%3A1732816076985345%2C%22post_id%22%3A2887921374808137%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.1732816076985345%3A306061129499414%3A2%3A0%3A1625122799%3A1179863955266017432&__tn__=C-R",
   "is_live":false,
   "factcheck":"None",
   "shared_post_id":"None",
   "shared_time":"None",
   "shared_user_id":"None",
   "shared_username":"None",
   "shared_post_url":"None",
   "available":true,
   "comments_full":"None",
   "reactors":"None",
   "w3_fb_url":"None",
   "reactions":"None",
   "reaction_count":"None",
   "image_id":"None",
   "image_ids":[

   ]
}
neon-ninja commented 3 years ago

I agree unauthenticated mbasic seems quite limited pagination-wise, but perhaps it might be useful for filling in information about posts given a list of post IDs. https://github.com/kevinzg/facebook-scraper/commit/4ad8d9a7b8f36c6d9d10e87bfa2b1a8e32a1faa2 should fix the issue with images_description, if you set a non-Android user agent. Like so:

set_proxy("socks5://127.0.0.1:9050")
set_noscript(True)
set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
for post in get_posts(post_urls=[4230186287065792]):
    pprint(post)
fashan7 commented 3 years ago

@neon-ninja with the latest update

post_text has a limited contents for this below post images has empty list since the post has an image comments is None, I C from mbasic we cannot get comment count, can we set it as 0 rather than None

`{
   "post_id":"2887921374808137",
   "text":"ZANU PF Party\n\nCOVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
   "post_text":"",
   "shared_text":"ZANU PF Party\n\nCOVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
   "time":datetime.datetime(2021,
   6,
   20,
   9,
   24,
   57),
   "image":"None",
   "image_lowquality":"None",
   "images":[

   ],
   "images_description":[

   ],
   "images_lowquality":[

   ],
   "images_lowquality_description":[

   ],
   "video":"None",
   "video_duration_seconds":"None",
   "video_height":"None",
   "video_id":"None",
   "video_quality":"None",
   "video_size_MB":"None",
   "video_thumbnail":"None",
   "video_watches":"None",
   "video_width":"None",
   "likes":0,
   "comments":"None",
   "shares":0,
   "post_url":"https://facebook.com/zanupfparty/posts/2887921374808137",
   "link":"None",
   "user_id":"1732816076985345",
   "username":"ZANU PF Party",
   "user_url":"https://facebook.com/zanupfparty/?refid=17&_ft_=mf_story_key.2887921374808137%3Atop_level_post_id.2887921374808137%3Atl_objid.2887921374808137%3Acontent_owner_id_new.1732816076985345%3Athrowback_story_fbid.2887921374808137%3Apage_id.1732816076985345%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX95RoqFVAssAwXl%3Apage_insights.%7B%221732816076985345%22%3A%7B%22page_id%22%3A1732816076985345%2C%22page_id_type%22%3A%22page%22%2C%22actor_id%22%3A1732816076985345%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1624177497%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B2887921374808137%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A1732816076985345%2C%22page_id%22%3A1732816076985345%2C%22post_id%22%3A2887921374808137%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.1732816076985345%3A306061129499414%3A2%3A0%3A1625122799%3A1179863955266017432&__tn__=C-R",
   "is_live":false,
   "factcheck":"None",
   "shared_post_id":"None",
   "shared_time":"None",
   "shared_user_id":"None",
   "shared_username":"None",
   "shared_post_url":"None",
   "available":true,
   "comments_full":"None",
   "reactors":"None",
   "w3_fb_url":"None",
   "reactions":"None",
   "reaction_count":"None",
   "image_id":"None",
   "image_ids":[

   ]
}`
neon-ninja commented 3 years ago

https://github.com/kevinzg/facebook-scraper/commit/831e4600da6e0f21fedf9fd9357db35e940deab1 should fix the comments=None issue, and https://github.com/kevinzg/facebook-scraper/commit/8709c35776c242fb728ec41d9838b34ae075a6e8 should make it possible to extract comments with mbasic. Depending on your user-agent, it might not be possible to click the post to get the full text without login. https://github.com/kevinzg/facebook-scraper/commit/e3cfe8b850238db0c7e724e8254020b2cfb2578b should fix the issue with missing low quality image links.