moda20 / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
60 stars 23 forks source link

Get_posts still not working #49

Open lullu57 opened 3 months ago

lullu57 commented 3 months ago

Hi, Opening the issue again as updated to most recently updated version and get posts still not working. Below is the code and terminal output:

from facebook_scraper import get_posts, _scraper
import json
import logging

with open('./headers.json', 'r') as file:
    _scraper.mbasic_headers = json.load(file)

logging.basicConfig(level=logging.DEBUG)

for post in get_posts('NintendoAmerica', base_url="https://mbasic.facebook.com", start_url=f"https://mbasic.facebook.com/NintendoAmerica?v=timeline",cookies="mbasic.facebook.com_cookies.json", pages=20):
    print(post['text'][:50])

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): facebook.com:443
DEBUG:urllib3.connectionpool:https://facebook.com:443 "GET /settings HTTP/1.1" 301 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.facebook.com:443
DEBUG:urllib3.connectionpool:https://www.facebook.com:443 "GET /settings HTTP/1.1" 200 509
DEBUG:facebook_scraper.facebook_scraper:Starting to iterate pages
DEBUG:facebook_scraper.page_iterators:Requesting page from: https://mbasic.facebook.com/NintendoAmerica?v=timeline
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): mbasic.facebook.com:443
DEBUG:urllib3.connectionpool:https://mbasic.facebook.com:443 "GET /NintendoAmerica?v=timeline HTTP/1.1" 200 None
DEBUG:facebook_scraper.page_iterators:Parsing page response
WARNING:facebook_scraper.page_iterators:No raw posts (<article> elements) were found in this page.
DEBUG:facebook_scraper.page_iterators:The page url is: https://mbasic.facebook.com/NintendoAmerica?v=timeline
DEBUG:facebook_scraper.page_iterators:The page content is:
+------------------------------------------------------------
| Nintendo of America
| /*<![CDATA[*/.be{background-color:#f6fafc;border:1px solid #ccd0d5;padding:12px 8px 0;}.bf{padding-right:14px;}.bg{padding:16px;}.bh{color:#444950;font-size:12px;line-height:normal;}.bi{font-size:14px;font-weight:bold;margin:4px 0 4px;}.bj{margin-bottom:8px;}.b ._4g33{border:0;border-collapse:collapse;margin:0;padding:0;width:100%;}.b ._4g33 tbody{vertical-align:top;}.b ._52we>tr>td,.b ._52we>tbody>tr>td,.b ._4g33 td._52we{vertical-align:middle;}.b ._4g33 td{padding:0;}.b ._4g33 td.r{padding:2px;}.b ._4g34{width:100%;}.img{border:0;display:inline-block;vertical-align:top;}i.img u{position:absolute;width:0;height:0;overflow:hidden;}#timelineBody,.bd .feed,.bd .co{background-color:#e9ebee;}.bd .feed{position:relative;z-index:0;}.bd .co:before{background-image:linear-gradient( rgba(176, 179, 188, .88), rgba(180, 183, 192, 0) );content:'';display:block;height:2px;left:0;position:absolute;right:0;top:0;}.bd.b .feed{margin-top:0;padding-bottom:2px;padding-top:2px;}.b a,.b a:visited{color:#3b5998;text-decoration:none;}.b .cm,.b .cm:visited{color:#6d84b4;}.b .bb,.b .bb:visited{color:#fff;}.b a:focus,.b a:hover,.b .cm:focus,.b .cm:hover{background-color:#3b5998;color:#fff;}.b .bb:focus,.b .bb:hover{background-color:#fff;color:#3b5998;}.acw{background-color:#fff;}.h{background-color:#3b5998;}.i{padding:2px 3px;}.apm{padding:4px 3px;}.bw{padding:6px 3px;}.bd .bk .cc{color:#42464d;font-size:19px;font-weight:bold;}.bd.b .bk .cc{margin-left:0;position:static;}.bd .bk .cl{padding-left:4px;}.b .br{height:100%;overflow:hidden;position:absolute;}.b .bl{background-color:#1c1e21;height:100px;position:relative;}.b .bs{height:auto;}.b .bt{width:inherit;}.b .bq{width:100%;}.b .bv{margin-top:-45px;}.by{position:relative;}.bz{border:2px solid #fff;box-shadow:0 1px 2px rgba(0, 0, 0, .1);display:inline-block;}.b .cb{padding:5px 0 5px 0;text-align:center;}.b .bx{text-align:center;}.ci .cj{display:block;}.cj{border:solid 2px;cursor:pointer;margin:0;padding:2px 6px 3px;text-align:center;}.ck,.b a.ck{background:#f3f4f5;border-color:#ccc #aaa #999;color:#505c77;}.h .ck,.b .h a.ck{background:#3b5998;border-color:#8a9ac5 #29447e #1a356e;color:#fff;}.cj .img{pointer-events:none;}.cj{display:inline-block;}.cj+.cj{margin-left:3px;}.cj input{background:none;border:none;margin:0;padding:0;}.ck input{color:#505c77;}.h .ck input{color:#fff;}.fcg{color:gray;}.mfss{font-size:small;}body,tr,input,textarea,.e{font-size:medium;}.ca{background:#eceff5;}.cf{word-wrap:break-word;}.cg{padding-left:16px;padding-right:16px;}._52ja{color:#4b4f56;}.v{font-weight:normal;}._52jh{font-weight:bold;}.ce{text-align:center;}.bu{background:#f2f2f2;}.bo:hover .bu{background:none;}._aps4._ap-_{background-image:url(https://static.xx.fbcdn.net/rsrc.php/v3/yI/r/m1xEJ8vDgZz.png);background-repeat:no-repeat;background-size:auto;background-position:-17px -52px;}._aps4._ap_1._ap_g{background-image:url(https://static.xx.fbcdn.net/rsrc.php/v3/yI/r/m1xEJ8vDgZz.png);background-repeat:no-repeat;background-size:auto;background-position:-17px -52px;margin-left:3px;vertical-align:-2.2px;}#mEventSchedulableWall,#m_profile_stream,#m_newsfeed_stream,#m_news_feed_stream #root{position:relative;z-index:0;}._ap-_{display:inline-block;height:13px;margin-left:4px;vertical-align:0;width:13px;}._5rgl{margin:0 6px 6px;padding:6px;}.b ._5rgl ._5rgl{border-color:#e9ebee;margin:6px 0 0;}._26zb{display:inline-block;}._5sq4{margin-top:5px;}._5rgn,._5rgo{margin:5px 0;}._5rgl a,._5rgl a:visited{color:#2b55ad;}._5rgl a:hover,._5rgl a:focus{background:#2b55ad;color:#fff;}._5nxi{font-size:small;}._7k7{list-style:none;margin:0;padding:0;}._56bf{border-color:#E4E6EB #CFD1D5 #b0b3b8;border-style:solid;border-width:1px;}._55wo{background:#fff;}._581t tr{font-size:inherit;}._lqt{font-size:small;}._67lm{position:relative;}._7jwi{display:inline-block;}._zdy{position:relative;}._zdy:after{border:1px solid rgba(0, 0, 0, .1);bottom:0;content:'';left:0;position:absolute;right:0;top:0;}._5usp .ai{background:#f2f2f2;display:inline-block;margin:0 3px 3px 0;}._5usp .ai.ail{margin-right:0;}.nowrap{white-space:nowrap;}._41nk{display:inline-block;margin-right:3px;vertical-align:middle;}.word_break{display:inline-block;}body{text-align:left;direction:ltr;}body,tr,input,textarea,button{font-family:sans-serif;}body,p,figure,h1,h2,h3,h4,h5,h6,ul,ol,li,dl,dd,dt{margin:0;padding:0;}h1,h2,h3,h4,h5,h6{font-size:1em;font-weight:bold;}ul,ol{list-style:none;}article,aside,figcaption,figure,footer,header,nav,section{display:block;}.d #viewport{margin:0 auto;max-width:600px;}#page{position:relative;}.o,.o.img{display:block;}.l{display:block;}.m{height:20px;width:20px;}.j{background:#3b5998;height:22px;padding:0 4px 4px;}.j.j .s{background:#fff;border:1px solid #07316d;box-sizing:border-box;font-size:small;height:22px;margin:0;width:100%;}.k.j{padding:1px 1px 3px;}.j .n{padding:1px 3px 0 0;}.j.j.j .x{background:#627aba;border:1px solid #083e89;color:#fff;font-size:x-small;font-weight:normal;height:22px;line-height:20px;margin-left:3px;}form{margin:0;border:0;}.q{border:0;display:block;margin:0;padding:0;}.b .cy{padding:0;}.b .r{padding:2px;}.w{-webkit-appearance:none;background:none;display:inline-block;font-size:12px;height:28px;line-height:28px;margin:0;overflow:visible;padding:0 9px;text-align:center;vertical-align:top;white-space:nowrap;}.b .w{border-radius:2px;}.y,a.y,html .b a.y{color:#fff;}.b .y{background-color:#4267b2;border:1px solid #365899;}.b a.y:hover,.b .y:hover{background-color:#465e91;}.y[disabled]{color:#899bc1;}.b .y[disabled]:hover{background-color:#4267b2;}.b a.w::after{content:'';display:inline-block;height:100%;vertical-align:middle;}.b .w{padding:0 8px;}.b a.w{height:26px;line-height:26px;}.cp{font-size:small;padding:7px 8px 8px;}.dg{border:1px solid;border-color:#8d949e;border-radius:4px;display:block;margin-top:8px;padding:4px;text-align:center;}.de{display:block;font-size:x-small;margin:-3px -3px 1px -3px;padding:3px;}.b .cp td.dd{padding-right:4px;}.b .cp td.df{padding-left:4px;}.cp.cq{background-color:#444950;}.cq{border-top:1px solid #444950;color:#bec3c9;}.b .cq a,.b .cq a:visited{color:#bec3c9;}.b .cq a:focus,.b .cq a:hover{background:#dadde1;color:#1d2129;}.cs{margin-bottom:8px;}.cp.cq .cs>table{background:#c9ccd1;border:1px solid #444950;}.cz{background:#c9ccd1;}.cr .dc{height:24px;line-height:24px;margin-left:2px;}.cu{background:#fff;}.cr .da{background-color:transparent;color:#4b4f56;display:block;padding:0;width:100%;}.cv .img{display:block;}.cr .ct .db{padding:2px;}.cr .ct .cv{padding:4px;}.b .cr .ct{border:1px solid #8d949e;}.z{padding-bottom:1px;}.ba{display:inline-block;font-size:small;padding:2px 4px 2px;}.bc{color:#fff496;}.b a:hover .bc,.b a:focus .bc,.ba:hover .bc,.ba:focus .bc{color:#365899;}/*]]>*/
| HomeEdit ProfileNotificationsFind FriendsPagesGroups(42)
| Menu
| This Page has been updated to the new Pages experience
| To continue managing this Page, you'll need to use a computer or the mobile app. Learn More
| Nintendo of America
| Welcome to the official Nintendo of America Facebook page, home of all things Nintendo! For customer
| Follow
| More
| About · Photos · Likes
| Nintendo of America
| Lights, curtains, action!
| Celebrate the Princess Peach: Showtime! launch with My Nintendo rewards: https://
| ninten.do/
| 6189ctrPx
| 27 March at 14:00 ·
| Public
| 591 · Like · React · 41 Comments · Share · Full Story · Save · Find support or report post
| Nintendo of America
| Save on select games featuring Mario and friends. Offer ends 3/16.
| https://
| www.nintendo.com
| /us/
| retail-offers/
| 10 March at 14:15 ·
| Public
| 890 · Like · React · 75 Comments · Share · Full Story · Save · Find support or report post
| See more stories
| 2024
| 2023
| 2022
| 2021
| 2020
| 2019
| 2018
| 2017
| 2016
| 2015
| 2014
| 2013
| 2012
| 2011
| Block this person
| Find support or report Page
| Create PageHelpSettings & privacy
| Report a problemTerms & PoliciesLog Out (Enrique)
| Back to Top+------------------------------------------------------------

DEBUG:facebook_scraper.page_iterators:Got 0 raw posts from page
DEBUG:facebook_scraper.facebook_scraper:Extracting posts from page 0
DEBUG:facebook_scraper.page_iterators:Looking for next page URL
INFO:facebook_scraper.page_iterators:Page parser did not find next page URL

The headers.json have been set as described in issue #22 and cookies are extracted in Json format using Get cookies.txt LOCALLY extension, while on the site.

moda20 commented 2 months ago

@lullu57 were you able to get a solution to this ? i am not sure what's the issue here

lullu57 commented 2 months ago

I wasn't able to discover whats causing the issue unfortunately, could be something related to the cookies, but I have tried everything as described.

iTrooz commented 2 months ago

I don't know much about this, but I don't think the problem is the cookies, since the response does contains a post: "Save on select games featuring Mario and friends. Offer ends 3/16"

Maybe it's an issue with the html parser in the library ?

lullu57 commented 2 months ago

I tried to debug into the repo and it seems like even though the posts are there, it does not recognise them and so does not iterate over them, so I think that could be the issue. I built a pyppeteer script that gets the page information that I need (only working with pages) and I'm working with that. For posts, i found this other repo which is working for me: https://github.com/shaikhsajid1111/facebook_page_scraper

moda20 commented 2 months ago

@lullu57 Pretty sure facebook is rolling a new dom update for mbasic to confuse scrappers like this repo, for me this is still not an issue so it would be great if you can check your html returns for posts (or @iTrooz ) and see if the <article> tag or <div role="article"> is representing the individual posts. that how the scrapper finds out about posts.

i do also have a fork of that repo : https://github.com/moda20/facebook_page_scraper that gets all images and in high res and gets other useful uniqueIds, i am using it now but with some smaller issues to be resolved. I still however use this repo for full text and sometimes Highres images extraction.

AssafGolani commented 1 week ago

I don't get posts either. my relevant code looks like this:

for group in GROUP_NAMES:
    for post in get_posts(group=group, pages=10, options={"posts_per_page": 50, "allow_extra_requests": False}):
        post_text = post.get('post_text')
        if not post_text:
            continue

        if any(word in post_text for word in INTERESTS) and not any(word in post_text for word in IGNORE):
            post_url = post['post_url'].replace("https://m.", "https://", 1)
            if post_url in prev_urls:
                print(f"Skipped URL {post_url}")
                continue
            data['date'].append(post['time'])
            data['link'].append(post_url)
            data['info'].append(post_text[:200])
            data['username'].append(post['username'])
            data['#comments'].append(post['comments'])
            data['#likes'].append(post['likes'])

print("Finished parsing Facebook results")

if not data['date']:
    print("No data for file")
    exit()

No data for file is printed all the time although the groups I added are public.

martincpt commented 1 week ago

I cannot understand why FB is such a jerk by closing all these routes. Why not let developers pay for their data? Then everyone would be happy.