Scraping a web page that uses a self-signed certificate results in an empty page being scraped

meilisearch / docs-scraper

Scrape documentation into Meilisearch

https://www.meilisearch.com

Other

288 stars 49 forks source link

Scraping a web page that uses a self-signed certificate results in an empty page being scraped #427

Closed frauniki closed 1 year ago

frauniki commented 1 year ago

When using selenium and headless chrome to scrape a web page that uses a self-signed certificate, headless chrome returns the following static html.

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

The page generated from this html is in fact an empty page, but since the status code is 200, no error is generated and the scraping is treated as successful.

I would like to add an option so that web pages using self-signed certificates can also be scraped successfully:)

brunoocasali commented 1 year ago

It may be because you're facing the privacy error wall from chrome, no?

Have you tried something like this https://stackoverflow.com/a/60250587/2649707?

sanders41 commented 1 year ago

I maybe be wrong but I think the proposal here is to add a config value that lets you add the options talked about in this stackoverflow post when running the scraper. Right now to do this I think you would have to fork the scraper and do a custom build?

brunoocasali commented 1 year ago

Oh, I see @sanders41. Then yes, you would have to fork and add it by yourself.

Unfortunately, @frauniki, we don't have enough resources to introduce new features into this software. If you want to make a PR, I would happily review it.

Thanks a lot for using Meilisearch ❤️