Open biggestsonicfan opened 2 years ago
I managed to get further with Gaston18Colores
. A total of 697 files with 15 of them getting the URL signature mismatch
(22 bytes each) error. Salmon88 is still a no-go.
Getting rid of m_pixel_ratio
and wd
from my cookies.txt file fixed the URL signature mismatch
errors. Still no luck grabbing all photos, or anything from 100044454968663
only grabbed 672 of their images
How do you know that's not everything?
Maybe try use get_posts
instead of get_photos
, to iterate through posts in the timelines, and collect images from the posts
get around the invalid URL signature mismatch files by passing a cookie to the python request.
You can use this code to use your cookies when requesting images:
from facebook_scraper import _scraper
image.write(_scraper.get(img).content)
But it shouldn't be necessary in theory, as CDN links shouldn't require cookies
I decided to see what for profile in get_profile(page_with_photos, cookies="fb-cookies.txt") would return, and I was very confused at the output:
get_profile
returns a dict
. If you iterate on a dict
, it's the same as iterating on it's .keys()
. Try pprint
it instead of iterating on it.
Did not see the issue template (as no one else seemed to use it).
I appreciate you following the issue template ;)
no luck grabbing all photos, or anything from 100044454968663
get_photos
doesn't currently go into albums, it just looks for photos that are visible on the "All photos" section on pages such as https://m.facebook.com/pg/Nintendo/photos/. https://m.facebook.com/100044454968663/photos doesn't seem to have such an "All photos" section, which is why get_photos
isn't returning anything for this account. This is how @lennoxho designed it in https://github.com/kevinzg/facebook-scraper/pull/289. See https://github.com/kevinzg/facebook-scraper/issues/576
How do you know that's not everything?
Scrolling the entire photos page or lewdpaws
, I can see about 100 more photos left in the gallery after scraping by comparing the last photo downloaded to it's position within the all photos page.
You can use this code to use your cookies when requesting images:
Oh interesting, I will keep that in mind!
get_photos doesn't currently go into albums, it just looks for photos that are visible on the "All photos" section on pages such as https://m.facebook.com/pg/Nintendo/photos/.
Ah that very much explains it then, thank you!
I will do a deeper dive into the lewdpaws
"all photos" page, probably spin up a new temporary folder to download into just to see how far it really gets and give a specific number again with the altered cookie settings and using the image.write
!
Reporting back and the get_posts
is working much better over get_photos
, but I've encountered an interesting situation that's probably quite rare:
This example post shows a Facebook link to a tweet. Obviously some sort of metadata is saved on Facebook, but the twitter account that made the post (@berensbby) no longer exists. Therefore, in the images
key of the post, a banner is attempted to be grabbed https://pbs.twimg.com/profile_banners/1008250875788320768/1591023218/1500x500
. Attempting to use image.write(_scraper.get(img).content)
throws various Document is empty
errors.
I feel this might fall slightly out of the scope of facebook-scraper but I did want to bring this to the attention of the developers.
EDIT: Using get_posts
on lewdpaws
is giving me many undesired results. Lots of links to memes or links to posted content that isn't theirs that they are sharing...
EDIT2: It appears, just like above, get_posts
is simply not going to be a viable option due to what Facebook has stored in it's metadata as images
. This lewdpaws post, for example, links to their picarto.tv account, where the "socials" image was captured as https://picarto.tv/images/socials/socials-cover.jpg
. While in the above example, the image no longer existed, this image results in picarto.tv's 404 page.
Just going to go ahead and write this one off as "too many variables to take into consideration" and it's not a problem that can actually be solved. I've moved onto different software for scraping.
The scraping software I used now charges $100 a year so I guess I will reopen this. I will attempt to revisit the script and my original code soon, but overall I think there's a difference between "providing a tool" and "making it do what you want", in which case facebook-scraper is the tool and many of us want to use it in different ways, and the implementation of said ways fall outside the scope of the issues.
Greetings,
For a very long time, I have been looking for a reasonable way to scrape artist's galleries from Facebook and nothing in the last 5-7 years has come close to ever working. Finally tired of it all, I found this repo with an up-to-date scrapper! My snippet code is as follows with
facebook-scraper==0.2.56
:Basically I find the images from
page_with_photos
and begin to save them one by one. Nothing too out of the ordinary here. Well, the first artist I wished to scrape waslewdpaws
, which only grabbed 672 of their images with 17 of the images being 22 bytes in size with the raw contentsURL signature mismatch
. I thought this was due to mypages
initially set to100
, but after settingpages
toNone
I still can't get more than the 672 results.After giving up on
lewdpaws
, I moved onto my next gallery,Gaston18Colores
. I managed to pull the first page withpages
set to1
as a test. This was the one and only time I could get data from this page. Now the script seemingly hangs forever doing nothing when I havepage_with_photos
set toGaston18Colores
.Tired, I decided to try my last gallery, which does not have a vanity username but has a profile id of
100044454968663
. Usingget_photos
with this value as theaccount
, the script ends almost immediately after it starts. Curious I decided to see whatfor profile in get_profile(page_with_photos, cookies="fb-cookies.txt")
would return, and I was very confused at the output:So all three of the galleries I wanted to scrape seemingly had issues, but where the problem lies, I can't quite say. I feel like it isn't the code, but I bet I might be able to get around the invalid
URL signature mismatch
files by passing a cookie to the python request.Any ideas?
EDIT: Did not see the issue template (as no one else seemed to use it).