Extracting posts from profile

BarBQ-code commented 3 years ago

Hey, I am writing my own scraper and I am taking inspiration from this project, so first of all thank you for maintaining and putting effort in here. I am a bit stuck on how to generally extract data from profiles, I've used mbasic to scrape data about posts and it worked great, but unfortunately I don't see a way to paginate through posts in profiles in mbasic, for some reason the "show more" button in mbasic doesn't seem to work, so I tried migration to m.facebook, but there I encountered two problems, I can't seem to workout how the pagination is implemented, I noticed that they are sending a cursor as a query param to get the next page, but I didn't manage to find the value in the page or some consistency other than it looks like an Id of the posts, one more problem is that it automatically loads more posts which is annoying. I've noticed in your codebase that you have a PageParser class that seems to be in charge of profiles (correct me if I am wrong), and seems like you are able to get the next page cursor using regex, so I tried to do the same, but for some reason no matter what URL I try requesting I am getting a 302 status code, and I can't seem to get pass that. I'd love some help getting to a solution, either by "patching" mbasic, or bypassing the m.facebook 302 status code problem.

Thanks in advance!

neon-ninja commented 3 years ago

For what account is pagination not working on mbasic for you? https://mbasic.facebook.com/profile/timeline/stream/?cursor=AQHRGS_37liwYy0FPeQJEThEcRlCCisKHnjrdiIMJta6ebtBpr-Gy--XGtvZRSME0mTm3QHhBMEmjMjVJTVD-HOFamDW6gd08GHIP5-O9vy7uOFpNq48em26dCle7OsFxk_a&start_time=-9223372036854775808&profile_id=4&replace_id=u_0_3_0a&refid=17 seems to work fine (if you're logged in - profiles require a login). Profile pagination urls have this /profile/timeline/stream pattern. The pagination link is the the one with the text "See more stories". Here's my regex for this: https://github.com/kevinzg/facebook-scraper/blob/fb15eb5b745d09bbcfcbd45bf1425e8c349ab03c/facebook_scraper/page_iterators.py#L105

302 is a redirect code. You should follow the redirect. But I suspect the redirect is to the login page, because your scraper isn't logged in

BarBQ-code commented 3 years ago

Just a dummy account I've created so I wouldn't get banned, do you think it might be the problem? Here is the link I am working on: https://mbasic.facebook.com/joebiden/ When I hit show more, it doesn't load anymore posts. It redirects me in m.facebook not mbasic, I know I am logged in just fine because I am able to get reactions and other stuff.. I've managed to bypass the redirection on m.facebook by using a mobile User Agent (don't know why I haven't thought of it before lol), but I still would rather use mbasic because the html is easier and not obfuscated and stuff.. Do you use this "cursor_regex_3 = re.compile( r'href:"(/profile/timeline/stream/\?cursor[^"]+)"' ) " regex in m.facebook or mbasic? I've tried it on a posts page on m.facebook and it didn't seem to work..

neon-ninja commented 3 years ago

Yeah looks like Biden's page doesn't work as well as Zuck's on mbasic. Actually, Biden is a Page, not a Profile. Would be this regex in that case: https://github.com/kevinzg/facebook-scraper/blob/fb15eb5b745d09bbcfcbd45bf1425e8c349ab03c/facebook_scraper/page_iterators.py#L103 All regex in this repo is intended for use on m.facebook.com. You could just import this package for the parts of m.facebook.com that you want to scrape from.

BarBQ-code commented 3 years ago

Hmm, do you happen to know if and what is the difference between pages and profiles? in terms of profiles? and does your scraper support both? I tried to use regex this way: resp = self._session.get('https://m.facebook.com/joebiden/posts/') soup = BeautifulSoup(resp.content, 'html.parser') cursor_regex_3 = re.compile(r'href:"(/profile/timeline/stream/?cursor[^"]+)"') content = soup.find_all(cursor_regex_3) And it doesn't seem to find anything, is it because it's a page and not a profile? or am I doing something wrong? (I am authenticated and able to get the content)

neon-ninja commented 3 years ago

Pages have a "Page transparency" section clearly visible. This scraper supports both. Yes. You can't use Profile regex for a Page.

neon-ninja commented 3 years ago

Here's Joe's page transparency - https://m.facebook.com/nt/screen/?params=%7B%22page_id%22%3A7860876103%7D&path=%2Fpages%2Ftransparency&state

BarBQ-code commented 3 years ago

Thanks, can you elaborate about what is page transparency, is it just general data about a certain page? And if your scraper supports both, how are you able to paginate through a pages posts?

neon-ninja commented 3 years ago

Yeah just general data as you can see. Each subsequent page has the page cursor embedded within it, which the regex extracts and uses for the next page request

BarBQ-code commented 3 years ago

Is there something similar for pages and not profiles, cause you said that the regex doesn't work on, and it indeed doesn't work on pages.. or some way I can find it in the DOM?

neon-ninja commented 3 years ago

I already linked to it in my above comment - this one https://github.com/kevinzg/facebook-scraper/blob/fb15eb5b745d09bbcfcbd45bf1425e8c349ab03c/facebook_scraper/page_iterators.py#L103

BarBQ-code commented 3 years ago

I've tried this also, on this posts page: resp = self._session.get('https://m.facebook.com/joebiden/posts/') soup = BeautifulSoup(resp.content, 'html.parser') cursor_regex = re.compile(r'href[:=]"(/page_content[^"]+)"') content = soup.find_all(cursor_regex_3) And it doesn't seem to work, does it work on URLs like I provided?

neon-ninja commented 3 years ago

You defined cursor_regex but then used cursor_regex_3

neon-ninja commented 3 years ago

Sometimes the cursor is in some commented out HTML and bs4 might be unhelpfully stripping out comments for you

BarBQ-code commented 3 years ago

I did, but copied the wrong piece of code, still didn't find it, and it doesn't seem the bs4 stripes out comments by default. I'll try to find the cursor in the comments but I haven't seen anything like that... How are you parsing the HTML, are using selenium or something like that? Plus, does it work for you for the page I've sent you?

neon-ninja commented 3 years ago

Nope, no selenium here, all pure Python and requests

BarBQ-code commented 3 years ago

Are you able to get the next curser posts of Biden's page using this regex?

neon-ninja commented 3 years ago

import requests
from bs4 import BeautifulSoup
import re
resp = requests.get('https://m.facebook.com/joebiden/posts/', headers={
        'Accept-Language': 'en-US,en;q=0.5',
        "Sec-Fetch-User": "?1",
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Mobile Safari/537.36",
})
cursor_regex = re.compile(r'href[:=]"(/page_content[^"]+)"')
print(cursor_regex.search(resp.text))
soup = BeautifulSoup(resp.content, 'html.parser')
print(soup.find_all(cursor_regex))

outputs

<re.Match object; span=(76974, 77317), match='href:"/page_content_list_view/more/?page_id=78608>
[]

So basic regex on the HTML works, but not through soup.find_all. See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression - when passing regex to find_all, it's expecting to match against an element name only.

BarBQ-code commented 3 years ago

Ohhh, I see, that's cool, thanks! I just made a request to the "next page" and it seems like a flawed JSON, can you make out anything of this? https://m.facebook.com/page_content_list_view/more/?page_id=7860876103&start_cursor=%7B%22timeline_cursor%22%3A%22AQHRv9-mORpDQsIfIlBB7ho7iJWP7FvRDqTr0mLAMDK_4mqHN1eVU69VhwoGrDEAPgXuofZe0TDGp94pY5incQz69N1ZWx_7kZvIveNl7YjDCTHwSLegytQNppmu3jyNtCuS%22%2C%22timeline_section_cursor%22%3Anull%2C%22has_next_page%22%3Atrue%7D&num_to_fetch=4&surface_type=posts_tab

neon-ninja commented 3 years ago

It's only minorly flawed, if you just skip past the for (;;);, it's valid JSON. Yes, I'm very familiar with pages like that, that's how this scraper does 99% of it's pagination. This scraper defines that token like so: https://github.com/kevinzg/facebook-scraper/blob/2b3d0d7b75e0ed257f616d8addcd3154b0813670/facebook_scraper/page_iterators.py#L101 And then it's stripped here: https://github.com/kevinzg/facebook-scraper/blob/2b3d0d7b75e0ed257f616d8addcd3154b0813670/facebook_scraper/page_iterators.py#L162

BarBQ-code commented 3 years ago

I see, I'll take a deeper look into it a bit later, thanks for your help so far!

kevinzg / facebook-scraper

Extracting posts from profile #338