Open vadimkantorov opened 1 year ago
That's quite an interesting environment. Yeah, that would be a great idea to let user customize executable path
Also, currently the scraper does not click on "Accept all cookies" (a banner message in EU) which leads to a hang. I resolved it by manually clicking on it in non-headless mode, but this problem prevents from using headless mode
Another idea is to implement some mode to also download images at the same time as scraping (as image url's sometimes can become expired)
Thanks for informing about that "Accept all cookies" modal.
Downloading image seems like a good idea but sending http request simultaneously while crawling might get us blocked quickly as well
I guess having an option for image download to a given folder would be a good knob to have. Also, sometimes an image is a photo attached to a post, and sometimes it's just a cover image attached to a post from a link (and then it's often on external... domain name, and has even no file extension before the question mark in the url), maybe a distinction would be good. Probably all images can just be downloaded to a single give output directory, given that their names are auto-generated and unique (maybe can just use a basename of the url)
Another way could be to download the images instantly after crawling the posts themselves (or having this option to choose when the photos are downloaded - together with posts or after the posts crawl is completed)
Also having some progress report / verbose option would be nice to be sure that crawl isn't hung during some problems with web page element waiters (which was the case with cookies modal)
Thanks, @vadimkantorov for the fabulous ideas. I will implement them eventually, it may take some time, but it seems like a great idea.
I have a funny env: WSLv1 Linux on Windows. Python runs inside Linux emulation, while Chrome and chromedriver.exe are running on Windows. I have a symlink
/usr/bin/chromedriver
pointing tochromedriver.exe
. This all works out well, but the automatic driver installer may get confused. So having an option to specify driver's executable_path explicitly to Facebook_scraper instance would be nice! Thanks! (when I patch Initializer manually, all works well!)