kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.47k stars 635 forks source link

Extract direct link for image and posts #213

Open Breizhux opened 3 years ago

Breizhux commented 3 years ago

Hello, I noticed that the url of the recovered images are not necessarily usable. Sometimes they are direct links with the domain "scontent-cdt1-1.xx.fbcdn.net". But sometimes the url is not direct and requires authentication to recover it, the domain in this case is m.facebook.com.

Public page : https://fr-fr.facebook.com/groups/saintyves.rennes/ Post concerned: https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/ Url that is retrieved for the image : https://m.facebook.com/photo/view_full_size/?fbid=3861145620587869&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg While the following url would be much more relevant : https://scontent-cdt1-1.xx.fbcdn.net/v/t1.6435-9/s960x960/176314059_3861145627254535_6708760356773320290_n.jpg?_nc_cat=110&ccb=1-3&_nc_sid=825194&_nc_ohc=nTqGUQ-o0h0AX_bnWV-&_nc_ht=scontent-cdt1-1.xx&tp=7&oh=c7ab3b2c862064ae5c12503f6707f434&oe=60A68072

By creating this ticket, I notice that the url of the post is not relevant either. The url retrieved for the post is : https://facebook.com/439909623068547/posts/1360623547663812 While this url is usable without an account : https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/

I understand the idea of retrieving urls even if you need an account to access them. But I think it would be very convenient to put also the direct url, usable without an account.

neon-ninja commented 3 years ago

This issue seems specific to groups, not pages.

The photos issue is tricky, as if you need an account to resolve the url to the full quality image, it's impossible for the scraper to resolve that unless you feed it cookies. It would still be possible to extract the low quality image. Perhaps we should always extract the low quality image, and also try to extract the full quality image if possible.

With the URLs problem, the regex needs to be updated for group posts. This problem was also reported in https://github.com/kevinzg/facebook-scraper/issues/165.

Breizhux commented 3 years ago

In my case I would need the images because they can contain information. (I'm creating a facebook page to rss feed converter, so the goal is not to have an account)

In general, I think it could be good to get a functional link depending on what is available, even if the quality is not good in the end. But I understand the reason to propose the best quality link. Or maybe even propose the different qualities available...

If not, maybe there is a way to get the html code of the posts? I did not find if it was possible. Since from there I could extract the link myself. That could be enough for me.

As for the direct links of the groups, they just seem to be all built in the same way: https://m.facebook.com/groups//permalink// I tested on several facebook groups, and several posts, I didn't get any error... For facebook pages, it seems to me much more complicated...

neon-ninja commented 3 years ago

@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see https://github.com/kevinzg/facebook-scraper/pull/217

It is possible to get the HTML, the parameter is remove_source, e.g. get_posts(account, remove_source=False)

I've also raised a separate pull request to fix the regexes for group posts - https://github.com/kevinzg/facebook-scraper/pull/216

pmdscully commented 3 years ago

@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see #217

Thanks for the change @neon-ninja . Note that the merge now empties the image and images (i.e. =None) fields instead of populating them.

VariabileAleatoria commented 3 years ago

I'm currently facing the same problem not on groups but on a shared post on a page

>>> list(get_posts('realgoblinhours', pages=2, cookies='cookies.txt'))[0]['image']
'https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg'
neon-ninja commented 3 years ago

In the time since you posted that comment, that post is no longer the first on the page. This code works fine though:

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

outputs

https://scontent.fakl1-3.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/180791935_796496920992238_7336250371130696992_n.jpg?_nc_cat=1&ccb=1-3&_nc_sid=110474&efg=eyJpIjoidCJ9&_nc_ohc=SzkHYeSZNWUAX-37aIG&_nc_ht=scontent.fakl1-3.fna&tp=14&oh=981ced36d2e8fbe4b243bae53e2f93e3&oe=60B24C27&manual_redirect=1

VariabileAleatoria commented 3 years ago
posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

Unfortunately this doesn't do the trick for me:

>>> posts = list(get_posts(
...     post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
...     cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg
neon-ninja commented 3 years ago
posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

Unfortunately this doesn't do the trick for me:

>>> posts = list(get_posts(
...     post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
...     cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg

You might need to recreate cookies.txt after changing your language

VariabileAleatoria commented 3 years ago
posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

I thought I already did that, it worked now.
Probably I disconnected from browser and that invalidates cookies I guess

neon-ninja commented 3 years ago

I pushed a commit to warn about non en_US locales present in result HTML, should help with this kind of problem https://github.com/kevinzg/facebook-scraper/commit/21ac8c49a71fcc0a17ec487e2b80d4a423dcfa83