medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
326 stars 59 forks source link

Force specific User Agent per crawl #461

Closed paulgirard closed 11 months ago

paulgirard commented 2 years ago

Hyphe get a random user agent for each crawl task from a webservice. For some websites one might need to fix the user agent used by the crawler; For instance website protected by cloudflare needs a cookie which is only valid for the user agent used to generate it. Therefore for such websites, one needs to :

So far setting the cookie is possible but not the User Agent. One enhancement would be to add this parameter by crawl the same way than cookie. The user agent settings at the crawl level would have precedence on the automatic random mechanism.