kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.34k stars 624 forks source link

[Question] Do you plan on adding get_profile to get the About section but for pages as well? #332

Open 97morningstar opened 3 years ago

97morningstar commented 3 years ago

I am using this library for a research project and I want to thank you for all your hard work :)

I have a few URLs that I wish to get their profile info (About Section). The problem is that I don't know what type of User the URL belongs to (A page or regular user). Thankfully using your get_posts function I can get the user_id, but if it's a page, I can't use your get_profile for it.

It works fine for User accounts.

The format I am passing to get_posts is:

"unique name of the page"/posts/"post id"

Other questions:

  1. I am also trying to get the user_id from a post in the following form:

https://www.facebook.com/photo.php?fbid=495386170902530&set=pb.100012934538675.-2207520000..&type=3

But it tells me:

HTTPError: 404 Client Error: Not Found for url: https://m.facebook.com/photo.php?fbid=495386170902530&set=pb.100012934538675.-2207520000..&type=3/

I am just using the code from the examples.

Example of my code:


for post in get_posts("elisamartinezfuentes/posts/1190292224745251", cookies=cookie_path, options={"allow_extra_requests": True}):
    account = post['user_id']

print(get_profile(account, cookies=cookie_path))

TL;DR: I need to get the info on the public about section for Pages/Groups/Users

I apologize if this is not the correct format for an issue.

neon-ninja commented 3 years ago

Yes, I recently added a function called get_page_info in https://github.com/kevinzg/facebook-scraper/commit/adfd6da732077730d4935e22854d17aa9f2f667b / https://github.com/kevinzg/facebook-scraper/discussions/326. Actually, there's no need to resolve account name to user ID, you can pass account names to these functions too.

If you don't know whether it's a page/group/user, you can just run each of the functions and check the output. Here's some sample code:

from facebook_scraper import *

# Profile, Page, Group
accounts = ["elisamartinezfuentes", "Nintendo", "lienminhnongnghieptute"]
functions = [get_profile, get_page_info, get_group_info]
for account in accounts:
    for function in functions:
        try:
            print(function, function(account, cookies="cookies.txt"))
        except:
            pass

output:

<function get_profile at 0x7fd8a54f2430> {'Name': 'Elisa Martínez Fuentes', 'Education': 'UMCC Universidad Matanzas Camilo ciefuengos\nIngeniería informática', 'Work': 'Allstate\nTest Engineer Intern\nOn May 18, 2019\nIrving, Texas\nJoyLabs\nWallbreakers Software Engineer Trainee\nIn 2019\nConroe, Texas', 'Places Lived': 'Houston, Texas\nCurrent City\nMoa, Cuba\nHometown', 'Contact Info': {'Facebook': '/elisamartinezfuentes', 'GitHub': '97morningstar', 'Instagram': '97morningstar'}, 'Basic Info': 'August 24, 1997\nBirthday\nFemale\nGender\nEspanol and English\nLanguages', 'Other Names': 'Elisa\nNickname', 'Relationship': {'to': 'Mario Bernal Jr.', 'type': 'Engaged', 'since': 'Since February 7, 2021'}, 'Family Members': 'Mi hermana\nSister\nMi papa\nFather\nMi mama\nMother\nIsmaray Martinez\nFamily member', 'Life Events': '2021\nGot Engaged to Mario Bernal Jr.\n2019\nIn a Relationship with Mario Bernal Jr.'}
<function get_page_info at 0x7fd8a54f24c0> {'name': 'Elisa Martínez Fuentes', 'identifier': 100012934538675, 'url': 'https://www.facebook.com/elisamartinezfuentes', 'image': 'https://scontent.fhlz2-1.fna.fbcdn.net/v/t1.6435-1/fr/cp0/e15/q65/74634432_755065824934562_3872097589268578304_n.jpg?_nc_cat=111&ccb=1-3&_nc_sid=8d0d6e&efg=eyJpIjoidCJ9&_nc_ohc=ctTSociQCz8AX9UVjqr&_nc_ht=scontent.fhlz2-1.fna&tp=14&oh=60191d22c4524cf806a9dae988c883ea&oe=60E96F7A', 'sameAs': None, 'dateCreated': None, 'type': 'Person'}
<function get_profile at 0x7fd8a54f2430> {'Name': 'Nintendo - About'}
<function get_page_info at 0x7fd8a54f24c0> {'name': 'Nintendo', 'identifier': 119240841493711, 'url': 'https://www.facebook.com/Nintendo/', 'image': 'https://scontent.fhlz2-1.fna.fbcdn.net/v/t1.18169-1/13138816_1025939437490509_1827627750936743506_n.png?_nc_cat=1&ccb=1-3&_nc_sid=8d0d6e&efg=eyJpIjoidCJ9&_nc_ohc=EVVea4o3SQ4AX_FxXMq&_nc_ht=scontent.fhlz2-1.fna&oh=4acd1086d0e2854ab8e31efff3ba4c22&oe=60EA12D2', 'sameAs': 'http://www.nintendo.com', 'dateCreated': '2011-06-01T15:23:50-0700', 'type': 'Organization', 'likes': 5412964, 'followers': 5471921}
<function get_group_info at 0x7fd8a54f2550> {'id': '627551287825967', 'name': 'LIÊN MINH NÔNG NGHIỆP TỬ TẾ', 'type': 'Public group', 'members': [{'name': 'Hoàng Nguyen', 'link': '/profile.php?id=100014268012641'}, {'name': 'Nguyễn Quốc Nam', 'link': '/profile.php?id=100015991331351'}, {'name': 'Nguyễn Thị Thu', 'link': '/thu.tqd'}, {'name': 'Thiên Vũ', 'link': '/profile.php?id=100027172146614'}, {'name': 'Trần Huế', 'link': '/profile.php?id=100025847746706'}, {'name': 'Ban Icat', 'link': '/ban.nguyenban.3'}, {'name': 'Dương Ngọc Phúc', 'link': '/profile.php?id=100007471802699'}, {'name': 'Dương Đình Tường', 'link': '/profile.php?id=100008623893421'}, {'name': 'Hà Trần Thị Thuý', 'link': '/ha.tranthithuy.1'}, {'name': 'Liên Giun Quế Ght', 'link': '/nguyenthi.lien.5'}, {'name': 'Lê Tuyền', 'link': '/lethingoctuyen1989'}, {'name': 'Nguyễn Luyến', 'link': '/profile.php?id=100008158538378'}, {'name': 'Nguyễn Phúc', 'link': '/profile.php?id=100026866214726'}, {'name': 'Phương Nguyễn Quang', 'link': '/phuong.nguyenquang.7'}, {'name': 'Quang Thai Tran', 'link': '/thaianh10001'}, {'name': 'Thu Hà', 'link': '/thuha2190'}, {'name': 'Tung Tran', 'link': '/tung.tran.33046'}, {'name': 'Tài MU', 'link': '/taimu1503'}, {'name': 'Đoàn Thanh Bình', 'link': '/profile.php?id=100003338407049'}], 'admins': [{'name': 'Hoàng Công', 'link': '/hoang.cong.79'}, {'name': 'Hoàng Nguyen', 'link': '/profile.php'}, {'name': 'Hoàng Thanh', 'link': '/hoangthanh588'}, {'name': 'Hưng Rivers East', 'link': '/HungRiversEast'}]}

Some functions do work with input that doesn't match their intended use, as you can see. But you can filter by expected keys.

97morningstar commented 3 years ago

Thank you @neon-ninja, python is amazing! :)

What about the "About" section on the Fb pages? I think with the get_profile and get_group_info I have almost everything I need. Do you plan on adding that in the future? If not I can try to add it and help this project.

(I will check my problem with "photo.php" later :), I will open a new one with more clarity/testing on that).

These are the types of URL I am working with, and trying to get their profile/about info from:

https://gist.github.com/97morningstar/9942fbae0e65c26c37d32d975b29bec3

With your little code snippet, I can start classifying users, groups, and pages from any Fb URL given (hopefully). Thanks!

neon-ninja commented 3 years ago

What information are you trying to extract from a page's about section that get_page_info doesn't extract? Do you have a sample page where it doesn't work? I've mostly just tested it with Nintendo's page.

With directly extracting a post, you should signify to the scraper that the URL represents one post so that it doesn't try to paginate. The argument for this is post_urls. This code:

post_urls = ["https://www.facebook.com/photo.php?fbid=495386170902530&set=pb.100012934538675.-2207520000..&type=3"]
pprint(next(get_posts(post_urls=post_urls, cookies="cookies.txt")))

Outputs:

{'available': True,
 'comments': 7,
 'comments_full': None,
 'factcheck': None,
 'image': 'https://scontent.fakl1-2.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/38280444_495386177569196_4472689928857190400_n.jpg?_nc_cat=109&ccb=1-3&_nc_sid=05277f&efg=eyJpIjoidCJ9&_nc_ohc=3tIZ-TSPPtAAX9NwgPP&_nc_ht=scontent.fakl1-2.fna&tp=14&oh=a0acf6d6286a8451a2060da1901d7db5&oe=60C895A9&manual_redirect=1',
 'image_lowquality': None,
 'images': ['https://scontent.fakl1-2.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/38280444_495386177569196_4472689928857190400_n.jpg?_nc_cat=109&ccb=1-3&_nc_sid=05277f&efg=eyJpIjoidCJ9&_nc_ohc=3tIZ-TSPPtAAX9NwgPP&_nc_ht=scontent.fakl1-2.fna&tp=14&oh=a0acf6d6286a8451a2060da1901d7db5&oe=60C895A9&manual_redirect=1'],
 'images_description': None,
 'images_lowquality': [],
 'images_lowquality_description': [],
 'is_live': False,
 'likes': 6,
 'link': None,
 'original_request_url': 'https://www.facebook.com/photo.php?fbid=495386170902530&set=pb.100012934538675.-2207520000..&type=3',
 'post_id': '495386170902530',
 'post_text': 'Elisa Martínez Fuentes',
 'post_url': 'https://m.facebook.com/495386170902530',
 'reaction_count': None,
 'reactions': None,
 'reactors': None,
 'shared_post_id': None,
 'shared_post_url': None,
 'shared_text': None,
 'shared_time': None,
 'shared_user_id': None,
 'shared_username': None,
 'shares': 0,
 'text': 'Elisa Martínez Fuentes',
 'time': datetime.datetime(2018, 8, 3, 0, 0),
 'user_id': None,
 'user_url': 'https://facebook.com/elisamartinezfuentes?refid=13&__tn__=%2Cg',
 'username': 'Elisa Martínez Fuentes',
 'video': None,
 'video_duration_seconds': None,
 'video_height': None,
 'video_id': None,
 'video_quality': None,
 'video_size_MB': None,
 'video_thumbnail': None,
 'video_watches': None,
 'video_width': None,
 'w3_fb_url': None}
neon-ninja commented 3 years ago

https://github.com/kevinzg/facebook-scraper/commit/5c904e8761735003a93c2e3c9803aa4259ee088a might help

97morningstar commented 3 years ago

What information are you trying to extract from a page's about section that get_page_info doesn't extract?

I am trying to access the info on this page, for example:

https://www.facebook.com/cityofgalveston/about/?ref=page_internal

The General, Hours, Additional Contact Info, and More Info sections, if they exist. Also website, phone, type of organization, and address, of course, as long as they are all of public access.

Thank you again, I uninstalled and installed again the package using pip. Should I download the repo here instead?

Do you have a sample page where it doesn't work?

Yes, this is my code snippet and output:

for post in get_posts("cityofgalveston/posts/4199446900116960", cookies=cookie_path):
    account = post['user_id']

print(get_page_info(account, cookies=cookie_path))

output:

{'name': 'City of Galveston, Texas - Government', 'identifier': 109633465765011, 'url': 'https://www.facebook.com/cityofgalveston/', 'image': 'https://scontent-dfw5-1.xx.fbcdn.net/v/t1.18169-1/fr/cp0/e15/q65/10929550_830039863724364_7151476885763355021_n.jpg?_nc_cat=110&ccb=1-3&_nc_sid=8d0d6e&efg=eyJpIjoidCJ9&_nc_ohc=4judlR7PZosAX93bLqE&_nc_ht=scontent-dfw5-1.xx&tp=14&oh=aa08ac379760b6c34d803ce623057564&oe=60C801D7', 'sameAs': 'http://www.galvestontx.gov', 'dateCreated': '2010-09-17T17:11:41-0700', 'type': 'Organization', 'followers': 47581}

I can't state enough my gratitude for this package, I spent almost a week trying the Facebook Graph API, until I realized that they need to approve every permission, and you may not get approved.

Note: This project is part of a Research Experience of Undergraduates program I am in.

neon-ninja commented 3 years ago

I haven't created a PyPI release containing this commit yet, you can use ​pip install git+https://github.com/kevinzg/facebook-scraper.git to install the latest master branch

neon-ninja commented 3 years ago
from facebook_scraper import *
from pprint import pprint
pprint(get_page_info("cityofgalveston", cookies="cookies.txt"))

outputs

{'about': '823 Rosenberg, Galveston, TX 77550\n'
          'Get Directions\n'
          'Closed Now\n'
          '·\n'
          '8 AM - 5 PM\n'
          'Closed Now\n'
          '·\n'
          '8 AM - 5 PM\n'
          'Sunday\n'
          'Monday\n'
          'Tuesday\n'
          'Wednesday\n'
          'Thursday\n'
          'Friday\n'
          'Saturday\n'
          'CLOSED\n'
          '8 AM - 5 PM\n'
          '8 AM - 5 PM\n'
          '8 AM - 5 PM\n'
          '8 AM - 5 PM\n'
          '8 AM - 5 PM\n'
          'CLOSED\n'
          '+1 409-797-3500\n'
          'http://www.galvestontx.gov/\n'
          "Welcome to the City of Galveston's official Facebook Page!\n"
          'Government Organization · Public & Government Service\n'
          'Send Message\n'
          'www.galvestontx.gov',
 'dateCreated': '2010-09-17T17:11:41-0700',
 'followers': 47591,
 'identifier': 109633465765011,
 'image': 'https://scontent.fakl1-3.fna.fbcdn.net/v/t1.18169-1/fr/cp0/e15/q65/10929550_830039863724364_7151476885763355021_n.jpg?_nc_cat=110&ccb=1-3&_nc_sid=8d0d6e&efg=eyJpIjoidCJ9&_nc_ohc=GrtCkObn3ToAX-zqxaf&_nc_ht=scontent.fakl1-3.fna&tp=14&oh=14601e305ca41eeb450c6a3434e6eb22&oe=60CBF657',
 'likes': 44080,
 'name': 'City of Galveston, Texas - Government',
 'sameAs': 'http://www.galvestontx.gov',
 'type': 'Organization',
 'url': 'https://www.facebook.com/cityofgalveston/'}
97morningstar commented 3 years ago

Thank you so much @neon-ninja
I am sorry to bother you again, I ran the code again with the updated code now and got this error when calling get_page_info

I don't get why sometimes I see the info and other times I get this error. For example, "cityofgalveston" is working fine now but "albertahealthservices" throws this error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-3c48354637ff> in <module>
     14 
     15 
---> 16 print(get_page_info("cityofgalveston", cookies=cookie_path))
     17 
     18 #post_urls = ["https://www.facebook.com/photo.php?fbid=495386170902530&set=pb.100012934538675.-2207520000..&type=3"]

~\anaconda3\anaconda\envs\findm\lib\site-packages\facebook_scraper\__init__.py in get_page_info(account, **kwargs)
     77     cookies = kwargs.pop('cookies', None)
     78     set_cookies(cookies)
---> 79     return _scraper.get_page_info(account, **kwargs)
     80 
     81 

~\anaconda3\anaconda\envs\findm\lib\site-packages\facebook_scraper\facebook_scraper.py in get_page_info(self, page, **kwargs)
    230             return None
    231         meta = json.loads(elem.text)
--> 232         result = meta["author"]
    233         result["type"] = result.pop("@type")
    234         desc = resp.html.find("meta[name='description']", first=True)

KeyError: 'author'
neon-ninja commented 3 years ago

This function works by finding a sample post and extracting the JSON-LD, which seems to vary depending on the post. Try this - https://github.com/kevinzg/facebook-scraper/commit/58629bc6f5474a3e59a2a90eff3aca439296e8ca

from facebook_scraper import *
from pprint import pprint
pprint(get_page_info("albertahealthservices", cookies="cookies.txt"))

outputs

{'about': 'http://www.albertahealthservices.ca/\n'
          'We are the provincial health authority responsible for planning and '
          'delivering health services to Albertans. Please read our Privacy '
          'Notice: http://bit.ly/102zhkz before posting on our page.\n'
          'Government Organization\n'
          'See what Alberta Health Services is doing in Messenger\n'
          'Get Started\n'
          'The AHS Facebook page was created to support Albertans as you make '
          'decisions about your own health and the health of your family, and '
          'help connect you to resources and services that are important to '
          'you.\n'
          '\n'
          'Privacy Notice:\n'
          '\n'
          'In accordance with health and privacy legislation, please do not '
          'post your personal health information or that of anyone else on '
          'this site. AHS does not provide personal medical advice on public '
          'social media sites. Anyone seeking medical advice should contact '
          'their physician or call HealthLink Alberta at 1-866-408-5465. In '
          'case of emergency, call 911 immediately.\n'
          '\n'
          'The following content is subject to removal from any AHS social '
          'media site:\n'
          '• Personal or Health Information or other confidential information\n'
          '• Abusive or vulgar language\n'
          '• Irrelevant to the subject matter or not related to AHS\n'
          '• Spam or another form of advertising; and/or violations of federal '
          'or provincial law\n'
          '\n'
          'AHS advises you that this site is public and any information posted '
          'on the site by you indicates your consent to share your '
          'information. Facebook retains ownership of the information '
          'regardless of whether the site is restricted and moderated. '
          'Individuals are encouraged to read Facebook’s privacy policy '
          'regarding the use of personal information posted on the site.\n'
          'Products\n'
          'For a complete list of programs, services, resources, news and '
          'initiatives at AHS, visit www.albertahealthservices.ca\n'
          '\n'
          '\n'
          'MEDICAL CARE\n'
          '\n'
          '\n'
          'Note: If you are concerned that you are seriously ill or injured, '
          'go to the nearest Emergency Department. Patients with potentially '
          'life-threatening conditions should immediately phone 911.\n'
          '\n'
          '\n'
          'Health Care Options\n'
          '\n'
          'Each community in Alberta offers a different range of health care '
          'services and programs. Use the information in this chart to help '
          'you choose and find the best health care and treatment option for '
          'your needs.\n'
          '\n'
          'http://www.albertahealthservices.ca/3381.asp\n'
          '\n'
          '\n'
          'Health Link Alberta\n'
          '\n'
          'Anyone in Alberta with a health question or concern can call to '
          'talk to a Registered Nurse and other health professionals 24 hours '
          'a day, seven days a week:\n'
          '\n'
          'Toll-free: 1-866-408-5465 (LINK)\n'
          'Edmonton: 780-408-5465 (LINK)\n'
          'Calgary: 403-943-5465 (LINK)\n'
          '\n'
          '\n'
          'My Health Alberta\n'
          '\n'
          'MyHealthAlberta is a single source of trusted online health '
          'information and health tools developed in partnership between the '
          'Government of Alberta and Alberta Health Services. The information '
          'and tools you will find on MyHealthAlberta were developed in '
          'consultation with health professionals, and Albertans.\n'
          '\n'
          'Check symptoms and find information on hundreds of health topics '
          'at:\n'
          '\n'
          'https://myhealth.alberta.ca/Pages/default.aspx\n'
          '\n'
          '\n'
          'Find Hospitals and Facilities\n'
          '\n'
          'Search for a hospital or health care facility near you:\n'
          '\n'
          'http://www.albertahealthservices.ca/facilities.asp?pid=facilities\n'
          '\n'
          '\n'
          'AHS Emergency Department Wait Times\n'
          '\n'
          'Before heading to the emergency department, check to see the '
          'estimated wait times at an emergency room near you. Please note: '
          'The estimated waiting time to see a physician in Emergency is '
          'approximate and is for informational purposes only. We provide care '
          'to the most critical cases first.\n'
          '\n'
          'http://www.albertahealthservices.ca/4770.asp\n'
          'Understanding Wait Times FAQ: '
          'http://www.albertahealthservices.ca/Data/ahs-data-edm-understanding-wait-times-patients.pdf\n'
          '\n'
          '\n'
          'FEEDBACK\n'
          '\n'
          '\n'
          'Patient Concerns and Feedback\n'
          '\n'
          'Your experience of care holds important information that helps us '
          'to continuously improve. We want to hear what you have to say so we '
          "can better understand what we're doing right and what we can do "
          'better.\n'
          '\n'
          'www.albertahealthservices.ca/patientfeedback.asp\n'
          '\n'
          '\n'
          'IN YOUR COMMUNITY\n'
          '\n'
          '\n'
          'Find Programs and Services\n'
          '\n'
          'Alberta Health Services has a wide range of programs and services '
          'available at locations across the province to serve the needs of '
          'community residents.\n'
          '\n'
          'http://www.albertahealthservices.ca/services.asp?pid=services\n'
          '\n'
          '\n'
          'Advisory Councils\n'
          '\n'
          'The Alberta Health Services team is committed to engaging the '
          'public in a respectful, open and accountable manner to support the '
          'strategic direction of the organization. Community input and '
          'feedback allows us to better address the health needs of '
          'communities. Learn more about our health and provincial advisory '
          'councils and how you can get involved:\n'
          '\n'
          'http://www.albertahealthservices.ca/communityrelations.asp\n'
          '\n'
          'Foundations and Trusts\n'
          '\n'
          'Alberta Health Services relies on its 64 Foundation & Health Trust '
          'partners to help drive innovation in health care. These '
          'organizations work diligently to gather community support, develop '
          'partnerships and raise critically-needed funds to further enhance '
          'the care delivered to patients and families in Alberta. They are '
          'committed to building excellence within our system.\n'
          '\n'
          'http://www.albertahealthservices.ca/255.asp\n'
          '\n'
          '\n'
          'CONNECT\n'
          '\n'
          'Social Media\n'
          '\n'
          'AHS uses a variety of social media channels to connect with '
          'Albertans in their communities and beyond, including Twitter, '
          'YouTube and Facebook. To find out more, visit:\n'
          '\n'
          'http://www.albertahealthservices.ca/socialmedia.asp\n'
          '\n'
          '\n'
          'Mobile App\n'
          '\n'
          'AHS is working to ensure that Albertans have innovative ways to '
          'access information about healthcare services by providing '
          'applications for mobile devices. Learn more and download the AHS '
          'Mobile App:\n'
          '\n'
          'http://www.albertahealthservices.ca/mobile.asp\n'
          '\n'
          '\n'
          'WORKING AT AHS\n'
          '\n'
          '\n'
          'AHS Careers\n'
          '\n'
          'For information about career opportunities at Alberta Health '
          'Services and steps to apply, please visit the AHS Careers website '
          'at:\n'
          '\n'
          'http://www.albertahealthservices.ca/careers/default.asp\n'
          'Privacy Policy',
 'followers': 83674,
 'image': 'https://scontent.fhlz2-1.fna.fbcdn.net/v/t1.6435-1/fr/cp0/e15/q65/119725810_3424921950879260_840471821714627235_n.jpg?_nc_cat=101&ccb=1-3&_nc_sid=c1fdac&efg=eyJpIjoidCJ9&_nc_ohc=gyZlAtpeMdUAX-40zMi&_nc_ht=scontent.fhlz2-1.fna&tp=14&oh=e41f04782a16416e03fcf70e4f87e808&oe=60CC7132',
 'likes': 74660,
 'name': 'Alberta Health Services',
 'type': 'Organization',
 'url': '/albertahealthservices/'}