osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

hoaxy init on afp.com fails #1

Closed fhamborg closed 7 years ago

fhamborg commented 7 years ago

When running hoaxy init with a domains_factchecking.txt that contains the following line

www.afp.com

I get the following error

(hoaxy) hoaxyuser@hoaxydeback:/root$ hoaxy init
2017-07-04 11:06:34,155 - hoaxy(init) - INFO: Creating database tables:
2017-07-04 11:06:34,155 - hoaxy(init) - WARNING: Ignore existed tables
2017-07-04 11:06:34,182 - hoaxy(init) - INFO: Inserting platforms if not exist
2017-07-04 11:06:34,215 - hoaxy(init) - INFO: Trying to load site data:
2017-07-04 11:06:34,215 - hoaxy(init) - INFO: Claim domains /home/hoaxyuser/.hoaxy/domains_claim.txt found
2017-07-04 11:06:34,215 - hoaxy(init) - INFO: Sending HTTP requests to infer base URLs ...
2017-07-04 11:06:42,714 - hoaxy(init) - INFO: Fact checking domains /home/hoaxyuser/.hoaxy/domains_factchecking.txt found
2017-07-04 11:06:42,714 - hoaxy(init) - INFO: Sending HTTP requests to infer base URLs ...
2017-07-04 11:06:44,110 - hoaxy(init) - ERROR: HTTPConnectionPool(host='afp.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa71187c490>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))
2017-07-04 11:06:44,114 - hoaxy(init) - ERROR: HTTPSConnectionPool(host='afp.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa711897c50>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))
2017-07-04 11:06:44,210 - hoaxy(init) - WARNING: line 18 'www.afp.com', domain inactive!
2017-07-04 11:06:44,210 - hoaxy(init) - ERROR: Please fix the warnings or errors above! Edit domains, or use --ignore-redirected to handle redirected domains', or Use --ignore-inactive or --force-inactive  to handle inactive domains

However, when visiting the domain in my browser, everything seems to work fine.

A second issue is that afp.com is actually publishing in English, but we would like to access the German version that is available at afp.com/de. However, hoaxy only accepts domains and not URLs. Any workaround for that?

shaochengcheng commented 7 years ago

Hi Felix,

About the Failing of the domain afp.com

It seems that this site is a little odd: you can visit it by www.afp.com, but the domain afp.com does not exist. Here is some ping information:

ping afp.com
ping: unknown host afp.com

ping www.afp.com
PING e10157.e12.akamaiedge.net (23.194.98.182) 56(84) bytes of data.
64 bytes from a23-194-98-182.deploy.static.akamaitechnologies.com (23.194.98.182): icmp_seq=1 ttl=56 time=7.41 ms

When hoaxy reads the domain list, it will ignore the prefix www., so that www.afp.com will be treated as afp.com. There are several ways to resolve this problem. One is by using a YAML file to load this site. You can check the sample file sites.sample.yaml. Here is an example of this site:

### afp.com
  # required, name of site
- name: afp.com
  # required, primary domain of factcheck.org
  domain: afp.com
  # required, type of this site, it is a fact checking site
  site_type: YOU SITE TYPE
  # base URL, USING www.afp.com
  base_url: http://www.afp.com/
  # site tags, default [], more about this site
...

Please check https://github.com/IUNetSci/hoaxy-backend/blob/master/hoaxy/data/manuals/sites.readme.md

Another way is by altering the database. When loading the domain list, using --force-inactive to force loading this site, and then update the table site , the SQL command could be:

UPDATE site
SET base_url='http://www.apf.com/, is_alive=True
WHERE domain LIKE 'apf.com'

About afp.com/de

Sorry to say that currently, hoaxy could not track site based on URLs. Maybe in the future, we can provide some kind of filter hook to apply filtering. Right now, what you could do is just using domain afp.com to track all related URLs (of course, this will include afp.com/de).

Thanks

glciampaglia commented 7 years ago

Closing issue for now; @fhamborg, feel free to reopen if there is any follow up question you would like to ask to Chengcheng. Thanks!