kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.47k stars 635 forks source link

Strange behavior with get_photos and get_profile #762

Open biggestsonicfan opened 2 years ago

biggestsonicfan commented 2 years ago

Greetings,

For a very long time, I have been looking for a reasonable way to scrape artist's galleries from Facebook and nothing in the last 5-7 years has come close to ever working. Finally tired of it all, I found this repo with an up-to-date scrapper! My snippet code is as follows with facebook-scraper==0.2.56:

from facebook_scraper import *
import os
from urllib.parse import urlparse
import time
import requests

#page_with_photos = 'lewdpaws' #627 images
#page_with_photos = 'Gaston18Colores'
page_with_photos = '100044454968663'
root_dir = '/run/media/XXXXX/bfd18/dl/fb-dl/'
save_dir = os.path.join(root_dir, page_with_photos)

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

for post in get_photos(account=page_with_photos, cookies="fb-cookies.txt", pages=None):
    for img in post['images']:
        if img is not None:            
            filename = '[Facebook] - ' + page_with_photos + ' - ' + os.path.basename(urlparse(img).path)
            full_path = os.path.join(save_dir, filename)
            exists = os.path.exists(full_path)
            if not exists:
                with open(full_path, 'wb') as image:
                    image.write(requests.get(img).content)
                print("File " + filename + " downloaded!")
                time.sleep(5)

Basically I find the images from page_with_photos and begin to save them one by one. Nothing too out of the ordinary here. Well, the first artist I wished to scrape was lewdpaws, which only grabbed 672 of their images with 17 of the images being 22 bytes in size with the raw contents URL signature mismatch. I thought this was due to my pages initially set to 100, but after setting pages to None I still can't get more than the 672 results.

After giving up on lewdpaws, I moved onto my next gallery, Gaston18Colores. I managed to pull the first page with pages set to 1 as a test. This was the one and only time I could get data from this page. Now the script seemingly hangs forever doing nothing when I have page_with_photos set to Gaston18Colores.

Tired, I decided to try my last gallery, which does not have a vanity username but has a profile id of 100044454968663. Using get_photos with this value as the account, the script ends almost immediately after it starts. Curious I decided to see what for profile in get_profile(page_with_photos, cookies="fb-cookies.txt") would return, and I was very confused at the output:

latest_post_id
Friend_count
Follower_count
Following_count
cover_photo_text
cover_photo
profile_picture
id
Name
Category
Contact info
Basic info
Page transparency
See all
Life events

So all three of the galleries I wanted to scrape seemingly had issues, but where the problem lies, I can't quite say. I feel like it isn't the code, but I bet I might be able to get around the invalid URL signature mismatch files by passing a cookie to the python request.

Any ideas?

EDIT: Did not see the issue template (as no one else seemed to use it).

biggestsonicfan commented 2 years ago

I managed to get further with Gaston18Colores. A total of 697 files with 15 of them getting the URL signature mismatch (22 bytes each) error. Salmon88 is still a no-go.

biggestsonicfan commented 2 years ago

Getting rid of m_pixel_ratio and wd from my cookies.txt file fixed the URL signature mismatch errors. Still no luck grabbing all photos, or anything from 100044454968663

neon-ninja commented 2 years ago

only grabbed 672 of their images

How do you know that's not everything?

Maybe try use get_posts instead of get_photos, to iterate through posts in the timelines, and collect images from the posts

get around the invalid URL signature mismatch files by passing a cookie to the python request.

You can use this code to use your cookies when requesting images:

from facebook_scraper import _scraper
image.write(_scraper.get(img).content)

But it shouldn't be necessary in theory, as CDN links shouldn't require cookies

I decided to see what for profile in get_profile(page_with_photos, cookies="fb-cookies.txt") would return, and I was very confused at the output:

get_profile returns a dict. If you iterate on a dict, it's the same as iterating on it's .keys(). Try pprint it instead of iterating on it.

Did not see the issue template (as no one else seemed to use it).

I appreciate you following the issue template ;)

no luck grabbing all photos, or anything from 100044454968663

get_photos doesn't currently go into albums, it just looks for photos that are visible on the "All photos" section on pages such as https://m.facebook.com/pg/Nintendo/photos/. https://m.facebook.com/100044454968663/photos doesn't seem to have such an "All photos" section, which is why get_photos isn't returning anything for this account. This is how @lennoxho designed it in https://github.com/kevinzg/facebook-scraper/pull/289. See https://github.com/kevinzg/facebook-scraper/issues/576

biggestsonicfan commented 2 years ago

How do you know that's not everything?

Scrolling the entire photos page or lewdpaws, I can see about 100 more photos left in the gallery after scraping by comparing the last photo downloaded to it's position within the all photos page.

You can use this code to use your cookies when requesting images:

Oh interesting, I will keep that in mind!

get_photos doesn't currently go into albums, it just looks for photos that are visible on the "All photos" section on pages such as https://m.facebook.com/pg/Nintendo/photos/.

Ah that very much explains it then, thank you!

I will do a deeper dive into the lewdpaws "all photos" page, probably spin up a new temporary folder to download into just to see how far it really gets and give a specific number again with the altered cookie settings and using the image.write!

biggestsonicfan commented 2 years ago

Reporting back and the get_posts is working much better over get_photos, but I've encountered an interesting situation that's probably quite rare:

This example post shows a Facebook link to a tweet. Obviously some sort of metadata is saved on Facebook, but the twitter account that made the post (@berensbby) no longer exists. Therefore, in the images key of the post, a banner is attempted to be grabbed https://pbs.twimg.com/profile_banners/1008250875788320768/1591023218/1500x500. Attempting to use image.write(_scraper.get(img).content) throws various Document is empty errors.

I feel this might fall slightly out of the scope of facebook-scraper but I did want to bring this to the attention of the developers.

EDIT: Using get_posts on lewdpaws is giving me many undesired results. Lots of links to memes or links to posted content that isn't theirs that they are sharing...

EDIT2: It appears, just like above, get_posts is simply not going to be a viable option due to what Facebook has stored in it's metadata as images. This lewdpaws post, for example, links to their picarto.tv account, where the "socials" image was captured as https://picarto.tv/images/socials/socials-cover.jpg. While in the above example, the image no longer existed, this image results in picarto.tv's 404 page.

biggestsonicfan commented 1 year ago

Just going to go ahead and write this one off as "too many variables to take into consideration" and it's not a problem that can actually be solved. I've moved onto different software for scraping.

biggestsonicfan commented 1 year ago

The scraping software I used now charges $100 a year so I guess I will reopen this. I will attempt to revisit the script and my original code soon, but overall I think there's a difference between "providing a tool" and "making it do what you want", in which case facebook-scraper is the tool and many of us want to use it in different ways, and the implementation of said ways fall outside the scope of the issues.