shaikhsajid1111 / facebook_page_scraper

Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV
https://pypi.org/project/facebook-page-scraper/
MIT License
210 stars 62 forks source link

Allow providing driver's executable_path explicitly without going through .install() method #36

Open vadimkantorov opened 1 year ago

vadimkantorov commented 1 year ago

I have a funny env: WSLv1 Linux on Windows. Python runs inside Linux emulation, while Chrome and chromedriver.exe are running on Windows. I have a symlink /usr/bin/chromedriver pointing to chromedriver.exe. This all works out well, but the automatic driver installer may get confused. So having an option to specify driver's executable_path explicitly to Facebook_scraper instance would be nice! Thanks! (when I patch Initializer manually, all works well!)

shaikhsajid1111 commented 1 year ago

That's quite an interesting environment. Yeah, that would be a great idea to let user customize executable path

vadimkantorov commented 1 year ago

Also, currently the scraper does not click on "Accept all cookies" (a banner message in EU) which leads to a hang. I resolved it by manually clicking on it in non-headless mode, but this problem prevents from using headless mode

vadimkantorov commented 1 year ago

Another idea is to implement some mode to also download images at the same time as scraping (as image url's sometimes can become expired)

shaikhsajid1111 commented 1 year ago

Thanks for informing about that "Accept all cookies" modal.

Downloading image seems like a good idea but sending http request simultaneously while crawling might get us blocked quickly as well

vadimkantorov commented 1 year ago

I guess having an option for image download to a given folder would be a good knob to have. Also, sometimes an image is a photo attached to a post, and sometimes it's just a cover image attached to a post from a link (and then it's often on external... domain name, and has even no file extension before the question mark in the url), maybe a distinction would be good. Probably all images can just be downloaded to a single give output directory, given that their names are auto-generated and unique (maybe can just use a basename of the url)

Another way could be to download the images instantly after crawling the posts themselves (or having this option to choose when the photos are downloaded - together with posts or after the posts crawl is completed)

Also having some progress report / verbose option would be nice to be sure that crawl isn't hung during some problems with web page element waiters (which was the case with cookies modal)

shaikhsajid1111 commented 1 year ago

Thanks, @vadimkantorov for the fabulous ideas. I will implement them eventually, it may take some time, but it seems like a great idea.