medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

Proxy host is not a valid hostname #416

Closed g-arcas closed 3 years ago

g-arcas commented 3 years ago

It looks like it is not possible to configure Hyphe to use a proxy identified by an IP address.

Context

When setting the Proxy option to this IP, Hyphe complains:

Even using the hostname (from /etc/hosts file) Hyphe still not takes the settings.

Any idea about where or why I'm doing something wrong?

Regards.

boogheta commented 3 years ago

Hello,

Indeed it looks like the test that is ran on the host when setting the proxy does not accept IP adresses.

We will fix it in a next release, but in the mean time if you want to hotfix it on your server, you can do the following: in hyphe_backend/core.tac line 164, change: if '/' in options['proxy']['host'] or not is_url(options['proxy']['host'], tld_aware=True, require_protocol=False): into: if '/' in options['proxy']['host'] or not (is_url(options['proxy']['host'], tld_aware=True, require_protocol=False) or urllru.special_hosts.match(options['proxy']['host'])):

g-arcas commented 3 years ago

Hello Banjamin.

Thank you for your answer!

My question's motivation was to be able to "plug" Hyphe to a proxy in order to archive all HTTP traffic in WARC format. Next step will be to set Tor Socks as upstream proxy in order to be able to scrap .onion websites as easily as "clean Internet" ones. Other question: I'd like to create a whitelist of domains that Hyphe will automatically tag as "out". Is it possible to have such a list that would be a global one, I mean => not dedicated or specific to a Hyphe corpus?

Best regards.

g-arcas commented 3 years ago

Hello Banjamin.

Thank you for your answer!

My question's motivation was to be able to "plug" Hyphe to a proxy in order to archive all HTTP traffic in WARC format. Next step will be to set Tor Socks as upstream proxy in order to be able to scrap .onion websites as easily as "clean Internet" ones.

Other question: I'd like to create a whitelist of domains that Hyphe will automatically tag as "out". Is it possible to have such a list that would be a global one, I mean => not dedicated or specific to a Hyphe corpus?

Best regards,


Guillaume Arcas

Analyste Renseignement sur les menaces [image: logo_poulpe_team-black-white-smaller.png] SEKOIA https://www.sekoia.fr/ | https://twitter.com/sekoia_fr* | @.**_team> | https://www.linkedin.com/company/sekoia/ 18-20 Place de la Madeleine, Paris


Pour préserver les arbres et les poulpes, n'imprimez ce message que si nécessaire.

Le lun. 23 août 2021 à 18:11, Benjamin Ooghe-Tabanou < @.***> a écrit :

Hello,

Indeed it looks like the test that is ran on the host when setting the proxy does not accept IP adresses.

We will fix it in a next release, but in the mean time if you want to hotfix it on your server, you can do the following: in hyphe_backend/core.tac line 164, change: if '/' in options['proxy']['host'] or not is_url(options['proxy']['host'], tld_aware=True, require_protocol=False): into: if '/' in options['proxy']['host'] or not (is_url(options['proxy']['host'], tld_aware=True, require_protocol=False) or urllru.special_hosts.match(options['proxy']['host'])):

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/hyphe/issues/416#issuecomment-903914415, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQMFC6DOFSRBF7ZNDGVOIDT6JXKFANCNFSM5B4H67BQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

boogheta commented 3 years ago

Hi Guillaume, I must warn you we never tried or intended so far to plug Hyphe to Tor and I cannot guarantee that onion urls will run through all Hyphe's routines properly and won't have much time to help you fix it otherwise. And regarding setting a global list of entities to set as OUT no, such a functionality does not exist in Hyphe, but it could easily be scripted using the API if you know how to code a little.

g-arcas commented 3 years ago

Hi Benjamin.

Thank you again for your answer. Don't worry, my main goal of using Hyphe is not to scrape Tor Network, it could be just a theoritical side-usage of Hyphe. So I won't ask for any help in this case. :-)

The need to whitelisting domains is far more "important" so I'll take a look at the way to do this through the API.

Last question (for the instant): I tried but failed to instantiate Hyphe with Manual install guidelines. It looks like the documentation was written for Ubuntu 16.04. Some needed packages are obsolete now and their newest versions break the installation process. Do you know if a updated version of the documentation for newer versions of Ubuntu (I usd 18.04 LTS and 20.04 LTS) is planned?

I actually use the docker mode but I am a quite "old-school" sysadmin who like to have dedicated servers with full installed software on it.

Best regards,


Guillaume Arcas

Analyste Renseignement sur les menaces [image: logo_poulpe_team-black-white-smaller.png] SEKOIA https://www.sekoia.fr/ | https://twitter.com/sekoia_fr* | @.**_team> | https://www.linkedin.com/company/sekoia/ 18-20 Place de la Madeleine, Paris


Pour préserver les arbres et les poulpes, n'imprimez ce message que si nécessaire.

Le mar. 24 août 2021 à 11:08, Benjamin Ooghe-Tabanou < @.***> a écrit :

Closed #416 https://github.com/medialab/hyphe/issues/416 via 7353928 https://github.com/medialab/hyphe/commit/7353928923fe6456914e4ef516c34c73493760a0 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/hyphe/issues/416#event-5199331362, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQMFCZGPDQIEZFKSKXA2XTT6NOQLANCNFSM5B4H67BQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

boogheta commented 3 years ago

For the API part, you can get inspired by the script hyphe_backend/test_client.py which is a helper to do that directly in the shell. The documentation of all API routes is available here: https://github.com/medialab/hyphe/blob/master/doc/api.md For your need, you should be interested mostly in the following routes: store.get_webentity_for_url and store.set_webentities_status

And regarding your last question, the manual documentation hasn't been updated for a while because Docker is so much easier, but I did full manual installs myself recently on some servers and you should be able to do so as well by just adjusting a few things. Mainly I'd recommand:

Let me know precisely if you run into more errors and I can try and help

g-arcas commented 3 years ago

Bonjour again Benjamin.

I guess on peut continuer en Français, mon technical english n'étant pas parfait, cela évitera des malentendus. :-)

Merci pour les infos : je ne suis pas contre l'utilisation de docker, il faut juste que je prenne le temps d'ajuster leur configuration pour stocker les données de façon pérennes dans un répertoire dédié, et non dans l'environnement du docker. Je testerai à nouveau l'installation manuelle en suivant vos conseils et vous tiendrai au courant.

Bonne journée et encore merci pour vos réponses !


Guillaume Arcas

Analyste Renseignement sur les menaces [image: logo_poulpe_team-black-white-smaller.png] SEKOIA https://www.sekoia.fr/ | https://twitter.com/sekoia_fr* | @.**_team> | https://www.linkedin.com/company/sekoia/ 18-20 Place de la Madeleine, Paris


Pour préserver les arbres et les poulpes, n'imprimez ce message que si nécessaire.

Le mar. 24 août 2021 à 14:53, Benjamin Ooghe-Tabanou < @.***> a écrit :

For the API part, you can get inspired by the script hyphe_backend/test_client.py https://github.com/medialab/hyphe/blob/master/hyphe_backend/test_client.py which is a helper to do that directly in the shell. The documentation of all API routes is available here: https://github.com/medialab/hyphe/blob/master/doc/api.md For your need, you should be interested mostly in the following routes: store.get_webentity_for_url and store.set_webentities_status

And regarding your last question, the manual documentation hasn't been updated for a while because Docker is so much easier, but I did full manual installs myself recently on some servers and you should be able to do so as well by just adjusting a few things. Mainly I'd recommand:

Let me know precisely if you run into more errors and I can try and help

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/hyphe/issues/416#issuecomment-904613247, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQMFC5YJPMMINUSJJVXLO3T6OI2ZANCNFSM5B4H67BQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .