Closed g-arcas closed 3 years ago
Hello,
Indeed it looks like the test that is ran on the host when setting the proxy does not accept IP adresses.
We will fix it in a next release, but in the mean time if you want to hotfix it on your server, you can do the following:
in hyphe_backend/core.tac
line 164, change:
if '/' in options['proxy']['host'] or not is_url(options['proxy']['host'], tld_aware=True, require_protocol=False):
into:
if '/' in options['proxy']['host'] or not (is_url(options['proxy']['host'], tld_aware=True, require_protocol=False) or urllru.special_hosts.match(options['proxy']['host'])):
Hello Banjamin.
Thank you for your answer!
My question's motivation was to be able to "plug" Hyphe to a proxy in order to archive all HTTP traffic in WARC format. Next step will be to set Tor Socks as upstream proxy in order to be able to scrap .onion websites as easily as "clean Internet" ones. Other question: I'd like to create a whitelist of domains that Hyphe will automatically tag as "out". Is it possible to have such a list that would be a global one, I mean => not dedicated or specific to a Hyphe corpus?
Best regards.
Hello Banjamin.
Thank you for your answer!
My question's motivation was to be able to "plug" Hyphe to a proxy in order to archive all HTTP traffic in WARC format. Next step will be to set Tor Socks as upstream proxy in order to be able to scrap .onion websites as easily as "clean Internet" ones.
Other question: I'd like to create a whitelist of domains that Hyphe will automatically tag as "out". Is it possible to have such a list that would be a global one, I mean => not dedicated or specific to a Hyphe corpus?
Best regards,
Guillaume Arcas
Analyste Renseignement sur les menaces [image: logo_poulpe_team-black-white-smaller.png] SEKOIA https://www.sekoia.fr/ | https://twitter.com/sekoia_fr* | @.**_team> | https://www.linkedin.com/company/sekoia/ 18-20 Place de la Madeleine, Paris
Pour préserver les arbres et les poulpes, n'imprimez ce message que si nécessaire.
Le lun. 23 août 2021 à 18:11, Benjamin Ooghe-Tabanou < @.***> a écrit :
Hello,
Indeed it looks like the test that is ran on the host when setting the proxy does not accept IP adresses.
We will fix it in a next release, but in the mean time if you want to hotfix it on your server, you can do the following: in hyphe_backend/core.tac line 164, change: if '/' in options['proxy']['host'] or not is_url(options['proxy']['host'], tld_aware=True, require_protocol=False): into: if '/' in options['proxy']['host'] or not (is_url(options['proxy']['host'], tld_aware=True, require_protocol=False) or urllru.special_hosts.match(options['proxy']['host'])):
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/hyphe/issues/416#issuecomment-903914415, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQMFC6DOFSRBF7ZNDGVOIDT6JXKFANCNFSM5B4H67BQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Hi Guillaume, I must warn you we never tried or intended so far to plug Hyphe to Tor and I cannot guarantee that onion urls will run through all Hyphe's routines properly and won't have much time to help you fix it otherwise. And regarding setting a global list of entities to set as OUT no, such a functionality does not exist in Hyphe, but it could easily be scripted using the API if you know how to code a little.
Hi Benjamin.
Thank you again for your answer. Don't worry, my main goal of using Hyphe is not to scrape Tor Network, it could be just a theoritical side-usage of Hyphe. So I won't ask for any help in this case. :-)
The need to whitelisting domains is far more "important" so I'll take a look at the way to do this through the API.
Last question (for the instant): I tried but failed to instantiate Hyphe with Manual install guidelines. It looks like the documentation was written for Ubuntu 16.04. Some needed packages are obsolete now and their newest versions break the installation process. Do you know if a updated version of the documentation for newer versions of Ubuntu (I usd 18.04 LTS and 20.04 LTS) is planned?
I actually use the docker mode but I am a quite "old-school" sysadmin who like to have dedicated servers with full installed software on it.
Best regards,
Guillaume Arcas
Analyste Renseignement sur les menaces [image: logo_poulpe_team-black-white-smaller.png] SEKOIA https://www.sekoia.fr/ | https://twitter.com/sekoia_fr* | @.**_team> | https://www.linkedin.com/company/sekoia/ 18-20 Place de la Madeleine, Paris
Pour préserver les arbres et les poulpes, n'imprimez ce message que si nécessaire.
Le mar. 24 août 2021 à 11:08, Benjamin Ooghe-Tabanou < @.***> a écrit :
Closed #416 https://github.com/medialab/hyphe/issues/416 via 7353928 https://github.com/medialab/hyphe/commit/7353928923fe6456914e4ef516c34c73493760a0 .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/hyphe/issues/416#event-5199331362, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQMFCZGPDQIEZFKSKXA2XTT6NOQLANCNFSM5B4H67BQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
For the API part, you can get inspired by the script hyphe_backend/test_client.py
which is a helper to do that directly in the shell. The documentation of all API routes is available here: https://github.com/medialab/hyphe/blob/master/doc/api.md For your need, you should be interested mostly in the following routes: store.get_webentity_for_url and store.set_webentities_status
And regarding your last question, the manual documentation hasn't been updated for a while because Docker is so much easier, but I did full manual installs myself recently on some servers and you should be able to do so as well by just adjusting a few things. Mainly I'd recommand:
Let me know precisely if you run into more errors and I can try and help
Bonjour again Benjamin.
I guess on peut continuer en Français, mon technical english n'étant pas parfait, cela évitera des malentendus. :-)
Merci pour les infos : je ne suis pas contre l'utilisation de docker, il faut juste que je prenne le temps d'ajuster leur configuration pour stocker les données de façon pérennes dans un répertoire dédié, et non dans l'environnement du docker. Je testerai à nouveau l'installation manuelle en suivant vos conseils et vous tiendrai au courant.
Bonne journée et encore merci pour vos réponses !
Guillaume Arcas
Analyste Renseignement sur les menaces [image: logo_poulpe_team-black-white-smaller.png] SEKOIA https://www.sekoia.fr/ | https://twitter.com/sekoia_fr* | @.**_team> | https://www.linkedin.com/company/sekoia/ 18-20 Place de la Madeleine, Paris
Pour préserver les arbres et les poulpes, n'imprimez ce message que si nécessaire.
Le mar. 24 août 2021 à 14:53, Benjamin Ooghe-Tabanou < @.***> a écrit :
For the API part, you can get inspired by the script hyphe_backend/test_client.py https://github.com/medialab/hyphe/blob/master/hyphe_backend/test_client.py which is a helper to do that directly in the shell. The documentation of all API routes is available here: https://github.com/medialab/hyphe/blob/master/doc/api.md For your need, you should be interested mostly in the following routes: store.get_webentity_for_url and store.set_webentities_status
And regarding your last question, the manual documentation hasn't been updated for a while because Docker is so much easier, but I did full manual installs myself recently on some servers and you should be able to do so as well by just adjusting a few things. Mainly I'd recommand:
- use pyenv instead of virtualenvwrapper
- use a more recent version of mongodb but not more recent than version 3.0
- forget all about installing scrapyd using packages, and rather install it manually in a dedicated python env using pip the same way the corresponding docker container works https://github.com/medialab/hyphe/blob/master/hyphe_backend/crawler/Dockerfile
Let me know precisely if you run into more errors and I can try and help
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medialab/hyphe/issues/416#issuecomment-904613247, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQMFC5YJPMMINUSJJVXLO3T6OI2ZANCNFSM5B4H67BQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
It looks like it is not possible to configure Hyphe to use a proxy identified by an IP address.
Context
When setting the Proxy option to this IP, Hyphe complains:
Even using the hostname (from /etc/hosts file) Hyphe still not takes the settings.
Any idea about where or why I'm doing something wrong?
Regards.