oduwsdl / CarbonDate

Estimating the age of web resources
MIT License
91 stars 11 forks source link

InvalidSchema? #3

Closed shawnmjones closed 7 years ago

shawnmjones commented 7 years ago

With the latest commit (e41e3a81e84bfed093768451f5a68e10368fbae2), when running the Docker instance of CarbonDate in local mode, as per http://ws-dl.blogspot.com/2016/09/2016-09-20-carbon-dating-web-version-30.html, I occasionally get an exception like the following.

# sudo docker run --rm -it carbon ./main.py -l search http://www.google.com
cdGetBitly.py::GetBitlyJson(), please set bitly access token in config
(<class 'requests.exceptions.InvalidSchema'>, InvalidSchema("No connection adapters were found for 'hive.org.uk/wayback/archive/20080304103855/http://www.google.com/'",), <traceback object at 0x7f6baa816088>)
Traceback (most recent call last):
  File "/usr/src/app/modules/cdGetArchives.py", line 135, in getArchives
    date = getRealDate(archives[archive]["link"],archives[archive]["time"])
  File "/usr/src/app/modules/cdGetArchives.py", line 85, in getRealDate
    response = requests.get(url,headers=headers)
  File "/usr/local/lib/python3.5/site-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.5/site-packages/requests/sessions.py", line 590, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python3.5/site-packages/requests/sessions.py", line 672, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'hive.org.uk/wayback/archive/20080304103855/http://www.google.com/'
runtime in seconds:  13
{
  "URI": "http://www.google.com",
  "Estimated Creation Date": "2003-01-14T00:00:00",
  "Bitly.com": "",
  "Google.com": "2003-01-14T00:00:00",
  "Bing.com": "",
  "Pubdate tag": "",
  "Last Modified": "",
  "Archives": [
    [
      "Earliest",
      ""
    ],
    [
      "By_Archive",
      {}
    ]
  ],
  "Twitter.com": "2006-04-13T02:58:51",
  "Backlinks": ""
}
ibnesayeed commented 7 years ago

As far as I remember, there is a -e flag to exclude modules. This exception is happening because you did not provide Bitly key and did not exclude that module. However, I do understand that it needs better documentation in the README as well as the exception should be caught and gracefully handled with a more friendly message or STDERR.

/cc @DarkAngelZT

shawnmjones commented 7 years ago

Well, I would agree, but I always get the "please set bitly access token in config" Bitly message, and do not always get this exception.

# sudo docker run --rm -it carbon ./main.py -l search http://www.cs.odu.edu
cdGetBitly.py::GetBitlyJson(), please set bitly access token in config
runtime in seconds:  8
{
  "URI": "http://www.cs.odu.edu",
  "Estimated Creation Date": "1997-03-24T17:29:34",
  "Pubdate tag": "",
  "Archives": [
    [
      "Earliest",
      "1997-03-24T17:29:34"
    ],
    [
      "By_Archive",
      {
        "http://web.archive.bibalex.org:80/web/20010414022512/http://www.cs.odu.edu/": "2001-03-23T14:55:45",
        "http://arquivo.pt/wayback/20091223043049/http://www.cs.odu.edu/": "2009-12-23T04:30:50",
        "http://web.archive.org/web/19971010201632/http://www.cs.odu.edu/": "1997-03-24T17:29:34",
        "http://archive.is/19970606105039/http://www.cs.odu.edu/": "1997-06-06T06:50:39",
        "http://webcitation.org/query?id=1327284086752784": "2012-01-22T21:01:29"
      }
    ]
  ],
  "Bitly.com": "",
  "Backlinks": "",
  "Last Modified": "",
  "Twitter.com": "2008-12-01T08:53:27",
  "Google.com": "2015-06-02T00:00:00",
  "Bing.com": ""
}

Instead, as mentioned in the requests.exceptions.InvalidSchema exception thrown by the Python requests module, the problem appears to be that something along the way discovered a memento at URI hive.org.uk/wayback/archive/20080304103855/http://www.google.com/ and this URI does not have a scheme (i.e., no "http://", "https://", etc.), causing the requests.get on line 85 of modules/cdGetArchives.py to fail.

ibnesayeed commented 7 years ago

Well, which means some sanity check needs to be placed (and fixed) before making the requests call.