Open Breizhux opened 3 years ago
This issue seems specific to groups, not pages.
The photos issue is tricky, as if you need an account to resolve the url to the full quality image, it's impossible for the scraper to resolve that unless you feed it cookies. It would still be possible to extract the low quality image. Perhaps we should always extract the low quality image, and also try to extract the full quality image if possible.
With the URLs problem, the regex needs to be updated for group posts. This problem was also reported in https://github.com/kevinzg/facebook-scraper/issues/165.
In my case I would need the images because they can contain information. (I'm creating a facebook page to rss feed converter, so the goal is not to have an account)
In general, I think it could be good to get a functional link depending on what is available, even if the quality is not good in the end. But I understand the reason to propose the best quality link. Or maybe even propose the different qualities available...
If not, maybe there is a way to get the html code of the posts? I did not find if it was possible. Since from there I could extract the link myself. That could be enough for me.
As for the direct links of the groups, they just seem to be all built in the same way:
https://m.facebook.com/groups/
@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see https://github.com/kevinzg/facebook-scraper/pull/217
It is possible to get the HTML, the parameter is remove_source
, e.g. get_posts(account, remove_source=False)
I've also raised a separate pull request to fix the regexes for group posts - https://github.com/kevinzg/facebook-scraper/pull/216
@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see #217
Thanks for the change @neon-ninja . Note that the merge now empties the image
and images
(i.e. =None
) fields instead of populating them.
I'm currently facing the same problem not on groups but on a shared post on a page
>>> list(get_posts('realgoblinhours', pages=2, cookies='cookies.txt'))[0]['image']
'https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg'
In the time since you posted that comment, that post is no longer the first on the page. This code works fine though:
posts = list(get_posts(
post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
cookies="cookies.txt"
))
print(posts[0]["image"])
outputs
posts = list(get_posts( post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"], cookies="cookies.txt" )) print(posts[0]["image"])
Unfortunately this doesn't do the trick for me:
>>> posts = list(get_posts(
... post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
... cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg
posts = list(get_posts( post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"], cookies="cookies.txt" )) print(posts[0]["image"])
Unfortunately this doesn't do the trick for me:
>>> posts = list(get_posts( ... post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"], ... cookies="cookies.txt" ... )) >>> print(posts[0]["image"]) https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg
You might need to recreate cookies.txt
after changing your language
posts = list(get_posts( post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"], cookies="cookies.txt" )) print(posts[0]["image"])
I thought I already did that, it worked now.
Probably I disconnected from browser and that invalidates cookies I guess
I pushed a commit to warn about non en_US locales present in result HTML, should help with this kind of problem https://github.com/kevinzg/facebook-scraper/commit/21ac8c49a71fcc0a17ec487e2b80d4a423dcfa83
Hello, I noticed that the url of the recovered images are not necessarily usable. Sometimes they are direct links with the domain "scontent-cdt1-1.xx.fbcdn.net". But sometimes the url is not direct and requires authentication to recover it, the domain in this case is m.facebook.com.
Public page : https://fr-fr.facebook.com/groups/saintyves.rennes/ Post concerned: https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/ Url that is retrieved for the image : https://m.facebook.com/photo/view_full_size/?fbid=3861145620587869&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg While the following url would be much more relevant : https://scontent-cdt1-1.xx.fbcdn.net/v/t1.6435-9/s960x960/176314059_3861145627254535_6708760356773320290_n.jpg?_nc_cat=110&ccb=1-3&_nc_sid=825194&_nc_ohc=nTqGUQ-o0h0AX_bnWV-&_nc_ht=scontent-cdt1-1.xx&tp=7&oh=c7ab3b2c862064ae5c12503f6707f434&oe=60A68072
By creating this ticket, I notice that the url of the post is not relevant either. The url retrieved for the post is : https://facebook.com/439909623068547/posts/1360623547663812 While this url is usable without an account : https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/
I understand the idea of retrieving urls even if you need an account to access them. But I think it would be very convenient to put also the direct url, usable without an account.