Closed fashan7 closed 3 years ago
@neon-ninja why is it coming like this
`/django_facebook/lib/python3.7/site-packages/facebook_scraper/facebook_scraper.py:270: UserWarning: Facebook served mbasic/noscript content unexpectedly on http://ipinfo.io/
warnings.warn(f"Facebook served mbasic/noscript content unexpectedly on {response.url}")`
when setting proxy
Disregard that, it's intended to warn about a possible error on Facebook so doesn't make sense off Facebook
Hi @neon-ninja How can i implement proxy using third party proxy provider https://www.scraperapi.com/documentation/ please kindly visit this link. Can u guide me how it works
I am trying like this but it gives error
set_proxy("http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001")
es/urllib3/connection.py", line 182, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc3fd6017d0>: Failed to establish a new connection: [Errno 60] Operation timed out
@neon-ninja any idea regarding this
@neon-ninja can we set proxies in facebook-scraper
session = HTMLSession(browser_args=["--no-sandbox","--proxy-server=http:%s"%self.proxy_ip])
set_proxy("http://lum-customer-hl_f6af7b92-zone-data_center:REDACTED_PASSWORD@zproxy.lum-superproxy.io:22225")
seems to work fine, outputs
Proxy details: {'ip': '178.171.41.23', 'city': 'Bucharest', 'region': 'Bucureşti', 'country': 'RO', 'loc': '44.4323,26.1063', 'org': 'AS9009 M247 Ltd', 'postal': '020011', 'timezone': 'Europe/Bucharest', 'readme': 'https://ipinfo.io/missingauth'}
No worries, I've edited my comment & yours to redact the password
@neon-ninja thanks
@neon-ninja can we set proxies in facebook-scraper
session = HTMLSession(browser_args=["--no-sandbox","--proxy-server=http:%s"%self.proxy_ip])
This one doesn't make any sense, facebook-scraper isn't a browser, and it doesn't have a sandbox mode. This would make more sense as arguments to the likes of Google Chrome
@neon-ninja the thing is this. I want to visit as an anonymous user to collect the header. but even when I set proxy it redirects to the login page. So what I did was I used the above credentials and set in the proxy bonanza and it perfectly can visit the page but when I visit via the script it redirects to the login page. I was wondering what's wrong with the set_proxy.
There's nothing wrong with set_proxy
, Facebook can tell you're using a proxy because probably hundreds of thousands or possibly even millions of other people/bots/scripts are using those exact IPs to scrape Facebook. So Facebook insists you login.
I can tell the proxy is working because I get a Romanian page title ("Facebooka daxil ol.") served from Facebook:
Proxy details: {'ip': '178.171.44.58', 'city': 'Bucharest', 'region': 'Bucureşti', 'country': 'RO', 'loc': '44.4323,26.1063', 'org': 'AS9009 M247 Ltd', 'postal': '020011', 'timezone': 'Europe/Bucharest', 'readme': 'https://ipinfo.io/missingauth'}
Starting to iterate pages
<!DOCTYPE html><html lang="az"><head><title>Facebooka daxil ol. | Facebook</title>
Facebooka daxil ol means Sign in to Facebook in Romanian.
I also get locale warnings like this one UserWarning: Locale detected as km_KH - for best results, set to en_US
I would recommend scraping through tor
I would recommend scraping through tor
can u please elloborate this
@neon-ninja have any idea regarding this
/lib/python3.7/site-packages/requests/adapters.py", line 510, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='m.facebook.com', port=443): Max retries exceeded with url: /182243201819458/posts/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Bad Gateway')))
@neon-ninja have any idea regarding this
/lib/python3.7/site-packages/requests/adapters.py", line 510, in send raise ProxyError(e, request=request) requests.exceptions.ProxyError: HTTPSConnectionPool(host='m.facebook.com', port=443): Max retries exceeded with url: /182243201819458/posts/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Bad Gateway')))
Your proxy doesn't work
I would recommend scraping through tor
can u please elloborate this
It's relatively easy in Ubuntu.
sudo apt install tor
torsocks curl https://api.ipify.org
I would recommend scraping through tor
can u please elloborate this
It's relatively easy in Ubuntu.
sudo apt install tor torsocks curl https://api.ipify.org
what is this for? purpose!
I would recommend scraping through tor
can u please elloborate this
It's relatively easy in Ubuntu.
sudo apt install tor torsocks curl https://api.ipify.org
what is this for? purpose!
@neon-ninja so if we need to go Anonymous, we need to use tor. right?. so do we need to set proxy for the script?
Depends if you use torsocks or not
@neon-ninja im currently using ubuntu for the scraping part.
I would recommend scraping through tor
can u please elloborate this
It's relatively easy in Ubuntu.
sudo apt install tor torsocks curl https://api.ipify.org
sudo apt install torsocks
@neon-ninja after installing
sudo apt install tor
sudo apt install torsocks
what should i need to do?
Make sure the tor service is running. Check it works with torsocks curl https://api.ipify.org
. If your script is called script.py, try torsocks python script.py
@neon-ninja Do i need to set proxy for this if i am running Torsocks.
No, torsocks intercepts all network traffic made by the wrapped command and routes it through tor. But you can set the proxy instead of using torsocks if you prefer, just google for instructions for configuring tor with Python requests
No, torsocks intercepts all network traffic made by the wrapped command and routes it through tor. But you can set the proxy instead of using torsocks if you prefer, just google for instructions for configuring tor with Python requests
https://sylvaindurand.org/use-tor-with-python/ this may help right!
@neon-ninja can we pass user agent like set_proxy in the facebook-scraper
headers = { 'User-Agent': UserAgent().random }
Yep, looks good. Sure, you can use the set_user_agent
function
@neon-ninja according to this instruction https://sylvaindurand.org/use-tor-with-python/ I have implemented like this
from facebook_scraper import *
from stem import Signal
from stem.control import Controller
proxies = {
'http': 'socks5://127.0.0.1:9050',
'https': 'socks5://127.0.0.1:9050'
}
with Controller.from_port(port = 9051) as c:
c.authenticate()
c.signal(Signal.NEWNYM)
response = requests.get('https://api.ipify.org', proxies=proxies).text
my_proxy = "http://"+response+":9051"
set_proxy(my_proxy)
But the result is this
HTTPSConnectionPool(host='m.facebook.com', port=443): Max retries exceeded with url: /182243201819458/posts/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5ca7e219a0>: Failed to establish a new connection: [Errno 111] Connection refused')))
Note: Actually proxy is getting generated randomly and I have set with port 9051
my_proxy = "http://"+response+":9051"
That doesn't make sense. You should use set_proxy("socks5://127.0.0.1:9050")
if you're not using torsocks
@neon-ninja
what if i use like this
torsocks python script.py
when I use above torsocks
the database connection is getting error, since its localhost
WARNING torsocks[2047360]: [connect] Connection to a local address are denied since it might be a TCP DNS query to a local DNS server. Rejecting it for safety reasons. (in tsocks_connect() at connect.c:193)
Database login error
1623747510 WARNING torsocks[2047360]: [connect] Connection to a local address are denied since it might be a TCP DNS query to a local DNS server. Rejecting it for safety reasons. (in tsocks_connect() at connect.c:193)
set_proxy("socks5://127.0.0.1:9050")
it seems
That doesn't make sense. You should use
set_proxy("socks5://127.0.0.1:9050")
if you're not using torsocks
@neon-ninja
Scripts says A login (cookies) is required to see this page
@neon-ninja any idea
`from facebook_scraper import *
>>> import requests
>>>
>>> set_proxy("socks5://127.0.0.1:9050")
>>> for post in get_posts("182243201819458", pages=1, timeout=45, options={'reactors': True}):
... print(post)
...
sys:1: UserWarning: A low page limit (<=2) might return no results, try increasing the limit
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 447, in _generic_get_posts
for i, page in zip(counter, iter_pages_fn()):
File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/page_iterators.py", line 25, in iter_pages
request_fn(start_url)
File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 369, in get
raise exceptions.LoginRequired(
facebook_scraper.exceptions.LoginRequired: A login (cookies) is required to see this page`
@neon-ninja I recommend for proxy checking
ip = self.get(“http://lumtest.com/myip.json”, headers={“Accept”: “application/json”}).json()
http://lumtest.com/myip.json
using this URL
@neon-ninja unfortunately we set facebook is asking Login even we went anonymous
Code is in this zip.
run code is in the terminal is python mypython.py
mypython.py.zip
Could you run it? Can we able to manage it get the data without cookies and proxies but with TOR feature
`lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py:324: UserWarning: Locale detected as id_ID - for best results, set to en_US
warnings.warn(f"Locale detected as {locale} - for best results, set to en_US")
Traceback (most recent call last):
File "mypython.py", line 33, in <module>
for post in get_posts("182243201819458", pages=1, timeout=45, options={'reactors': True}):
File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 449, in _generic_get_posts
for i, page in zip(counter, iter_pages_fn()):
File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/page_iterators.py", line 25, in iter_pages
request_fn(start_url)
File "/home/sanath/django_facebook/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 371, in get
raise exceptions.LoginRequired(
facebook_scraper.exceptions.LoginRequired: A login (cookies) is required to see this page
`
Well, so much for that plan. It seems Facebook can tell when you're connecting from a tor exit node, and insist you login. If they say you have to login, you have to login.
If you run from CLI you can use: HTTPS_PROXY="super-proxy.com:3388" bash -c "/usr/local/bin/facebook-scraper --filename filename.csv --pages 10 pagename"
I also noticed mbasic seems to enforce login less than m.facebook.com
Try this code:
set_proxy("socks5://127.0.0.1:9050")
set_noscript(True)
for post in get_posts("Nintendo"):
print(post["post_id"], post["time"])
with the tor service running and the latest master branch.
Try this code:
set_proxy("socks5://127.0.0.1:9050") set_noscript(True) for post in get_posts("Nintendo"): print(post["post_id"], post["time"])
with the tor service running and the latest master branch.
@neon-ninja if we are using with above method it gives only 4 post even if we have increased the page number.
@neon-ninja while visiting from mbasic.facebook.com/zanupfparty
In the description, there is no full description.
{
"post_id":"2887921374808137",
"text":"COVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
"post_text":"COVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
"shared_text":"None",
"time":datetime.datetime(2021,
6,
20,
9,
24,
57),
"image":"None",
"image_lowquality":"None",
"images":[
],
"images_description":[
],
"images_lowquality":[
],
"images_lowquality_description":[
],
"video":"None",
"video_duration_seconds":"None",
"video_height":"None",
"video_id":"None",
"video_quality":"None",
"video_size_MB":"None",
"video_thumbnail":"None",
"video_watches":"None",
"video_width":"None",
"likes":"None",
"comments":"None",
"shares":0,
"post_url":"https://facebook.com/zanupfparty/posts/2887921374808137",
"link":"None",
"user_id":"1732816076985345",
"username":"ZANU PF Party",
"user_url":"https://facebook.com/zanupfparty/?refid=17&_ft_=mf_story_key.2887921374808137%3Atop_level_post_id.2887921374808137%3Atl_objid.2887921374808137%3Acontent_owner_id_new.1732816076985345%3Athrowback_story_fbid.2887921374808137%3Apage_id.1732816076985345%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX95RoqFVAssAwXl%3Apage_insights.%7B%221732816076985345%22%3A%7B%22page_id%22%3A1732816076985345%2C%22page_id_type%22%3A%22page%22%2C%22actor_id%22%3A1732816076985345%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1624177497%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B2887921374808137%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A1732816076985345%2C%22page_id%22%3A1732816076985345%2C%22post_id%22%3A2887921374808137%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.1732816076985345%3A306061129499414%3A2%3A0%3A1625122799%3A1179863955266017432&__tn__=C-R",
"is_live":false,
"factcheck":"None",
"shared_post_id":"None",
"shared_time":"None",
"shared_user_id":"None",
"shared_username":"None",
"shared_post_url":"None",
"available":true,
"comments_full":"None",
"reactors":"None",
"w3_fb_url":"None",
"reactions":"None",
"reaction_count":"None",
"image_id":"None",
"image_ids":[
]
}
I agree unauthenticated mbasic seems quite limited pagination-wise, but perhaps it might be useful for filling in information about posts given a list of post IDs. https://github.com/kevinzg/facebook-scraper/commit/4ad8d9a7b8f36c6d9d10e87bfa2b1a8e32a1faa2 should fix the issue with images_description, if you set a non-Android user agent. Like so:
set_proxy("socks5://127.0.0.1:9050")
set_noscript(True)
set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
for post in get_posts(post_urls=[4230186287065792]):
pprint(post)
@neon-ninja with the latest update
post_text
has a limited contents for this below post
images
has empty list since the post has an image
comments
is None, I C from mbasic we cannot get comment count, can we set it as 0 rather than None
`{
"post_id":"2887921374808137",
"text":"ZANU PF Party\n\nCOVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
"post_text":"",
"shared_text":"ZANU PF Party\n\nCOVID-19 UPDATE: As at 20 JUNE 2021, ZIMBABWE🇿🇼 has 41 628 confirmed cases, including 37 167 recoveries, 2 795 Active Cases, 293 New Cases and 1 666 deaths (Recording 24 New Recoveries and 10 Deaths in the last 24hrs ) People Vaccinated so far (1st Dose)701 348 and (2nd Dose) 432 572 ...",
"time":datetime.datetime(2021,
6,
20,
9,
24,
57),
"image":"None",
"image_lowquality":"None",
"images":[
],
"images_description":[
],
"images_lowquality":[
],
"images_lowquality_description":[
],
"video":"None",
"video_duration_seconds":"None",
"video_height":"None",
"video_id":"None",
"video_quality":"None",
"video_size_MB":"None",
"video_thumbnail":"None",
"video_watches":"None",
"video_width":"None",
"likes":0,
"comments":"None",
"shares":0,
"post_url":"https://facebook.com/zanupfparty/posts/2887921374808137",
"link":"None",
"user_id":"1732816076985345",
"username":"ZANU PF Party",
"user_url":"https://facebook.com/zanupfparty/?refid=17&_ft_=mf_story_key.2887921374808137%3Atop_level_post_id.2887921374808137%3Atl_objid.2887921374808137%3Acontent_owner_id_new.1732816076985345%3Athrowback_story_fbid.2887921374808137%3Apage_id.1732816076985345%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX95RoqFVAssAwXl%3Apage_insights.%7B%221732816076985345%22%3A%7B%22page_id%22%3A1732816076985345%2C%22page_id_type%22%3A%22page%22%2C%22actor_id%22%3A1732816076985345%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1624177497%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B2887921374808137%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A1732816076985345%2C%22page_id%22%3A1732816076985345%2C%22post_id%22%3A2887921374808137%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.1732816076985345%3A306061129499414%3A2%3A0%3A1625122799%3A1179863955266017432&__tn__=C-R",
"is_live":false,
"factcheck":"None",
"shared_post_id":"None",
"shared_time":"None",
"shared_user_id":"None",
"shared_username":"None",
"shared_post_url":"None",
"available":true,
"comments_full":"None",
"reactors":"None",
"w3_fb_url":"None",
"reactions":"None",
"reaction_count":"None",
"image_id":"None",
"image_ids":[
]
}`
https://github.com/kevinzg/facebook-scraper/commit/831e4600da6e0f21fedf9fd9357db35e940deab1 should fix the comments=None issue, and https://github.com/kevinzg/facebook-scraper/commit/8709c35776c242fb728ec41d9838b34ae075a6e8 should make it possible to extract comments with mbasic. Depending on your user-agent, it might not be possible to click the post to get the full text without login. https://github.com/kevinzg/facebook-scraper/commit/e3cfe8b850238db0c7e724e8254020b2cfb2578b should fix the issue with missing low quality image links.
Hi @neon-ninja How can i implement proxy using third party proxy provider https://www.scraperapi.com/documentation/ please kindly visit this link. Can u guide me how it works
I am trying like this but it gives error
set_proxy("http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001")