mementoweb / py-memento-client

A Memento Client Library in Python
Other
25 stars 6 forks source link

webcite : cannot parse XML #5

Closed jayvdb closed 8 years ago

jayvdb commented 8 years ago

webcite is failing. I appreciate that this project probably isnt to blame, and have raised https://github.com/mementoweb/timegate/issues/4 for where I suspect the problem lies. So this is an #upstream tracker, and I am not 100% sure where the upstream repo is. ;-)

this problem is causing https://github.com/wikimedia/pywikibot-core tests to fail.

shawnmjones commented 8 years ago

Thanks for the heads up. We are trying to narrow down the location of the issue.

shawnmjones commented 8 years ago

We believe we have corrected the issue with WebCite. Is the problem corrected?

jayvdb commented 8 years ago

http://timetravel.mementoweb.org/webcite/timegate/ and http://delorean.lanl.gov/tg/webcite/timegate/ and http://labs.mementoweb.org/webcite/timegate/ all still report 404 Cannot parse XML

(and I still see errors in new builds)

shawnmjones commented 8 years ago

We have discovered that WebCite has some sort of rate limiting that causes connections to that site to fail. Attempts to discuss the issue with them have met with no response. Our tests indicate that this TimeGate does function, but occasionally WebCite produces a timeout for the connection.

I will provide more information tomorrow.

shawnmjones commented 8 years ago

We have explored why this is working intermittently for WebCite. For cached URI-Rs, the TimeGate works fine until they expire. For new URI-Rs, the result from WebCite could be rate-limited or not, depending on how many others are using Memento at the time and depending on how many of those requests get routed to WebCite's TimeGate.

At the current moment, the WebCite TimeGate is responding for URI-Rs. Do you have any examples of URI-Rs that are not working?

curl --head -L -H "Accept-Datetime: Thu, 20 Mar 2003 19:54:39 GMT" http://timetravel.mementoweb.org/webcite/timegate/http://www.cs.odu.edu
HTTP/1.1 302 Found
Server: nginx/1.8.0
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
Date: Wed, 09 Mar 2016 00:14:21 GMT
Vary: accept-datetime
Location: http://www.webcitation.org/64ta04WpM
Link: <http://www.cs.odu.edu>; rel="original", <http://timetravel.mementoweb.org/webcite/timemap/link/http://www.cs.odu.edu>; rel="timemap"; type="application/link-format", <http://timetravel.mementoweb.org/webcite/timemap/json/http://www.cs.odu.edu>; rel="timemap"; type="application/json", <http://www.webcitation.org/64ta04WpM>; rel="first memento"; datetime="Mon, 23 Jan 2012 02:01:29 GMT", <http://www.webcitation.org/6ez6rJWMq>; rel="last memento"; datetime="Tue, 02 Feb 2016 01:25:20 GMT"
X-Cache: MISS from proxyout.lanl.gov
X-Cache-Lookup: MISS from proxyout.lanl.gov:8080
Via: 1.1 proxyout.lanl.gov (squid)
Connection: keep-alive

HTTP/1.1 200 OK
Date: Wed, 09 Mar 2016 00:14:21 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 PHP/5.2.9
X-Powered-By: PHP/5.2.9
Set-Cookie: PHPSESSID=867c2364a1841db17695d23de60fed1d; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Type: text/html
X-Cache: MISS from proxyout.lanl.gov
X-Cache-Lookup: MISS from proxyout.lanl.gov:8080
Via: 1.1 proxyout.lanl.gov (squid)
Connection: keep-alive

I'm a bit confused by the use of the http://timetravel.mementoweb.org/webcite/timegate/ URI itself, as it is a prefix to be used for specific URI-Rs. For example http://timetravel.mementoweb.org/webcite/timegate/http://www.cnn.com. If a user just sends a request to this prefix, they would normally receive a "400 Bad Request" because there is no URI-R attached to the end. Where is this getting displayed?

Also, do you mind sharing the test output or providing a URI to the job so I can review it?

Thanks.

jayvdb commented 8 years ago

Sure, our CI builds are at https://travis-ci.org/wikimedia/pywikibot-core/builds

For each build, the even numbered jobs install and use py-memento-client . job x.2 currently has a recent breakage, so you can see this problem in job x.4.

e.g. https://travis-ci.org/wikimedia/pywikibot-core/jobs/114704574#L3897

The test code is at https://github.com/wikimedia/pywikibot-core/blob/master/tests/weblinkchecker_tests.py#L62

Our test suite using a hostname variable (actually a URL) in each test class to determine whether a service is alive, and skips the test class if it is down.

So the test class is trying to verify that 'http://timetravel.mementoweb.org/webcite/timegate/' is alive, and currently that URL is a 404 (not 400). It definitely did not do this in November 2015. If it helps, I can find which build (and thus date) the 404's started appearing.

shawnmjones commented 8 years ago

We have reviewed the code for our TimeGate proxy for WebCite. It now returns a 400.

It would be nice to know which date the 404s started happening. What status code was being returned prior?

shawnmjones commented 8 years ago

I have successfully reproduced the result locally using the pywikibot-core code.

shawnmjones commented 8 years ago

Okay, after adding some additional archives to a local copy of weblinkchecker_tests.py, I think I've determined where the issue lies.

There are two components to the URI-G issued as part of a TimeGate request: what we have been referring to as "the TimeGate URI" and the "original resource URI" (URI-R). To issue the request, one concatenates these two to each other, like so: http://web.archive.org/web/http://www.cnn.com

This is how we ensure that TimeGates are functioning with our daily Memento infrastructure tests.

There are Memento-compliant TimeGates running OpenWayback, for places like the Internet Archive, Bibliotheca Alexandrina, UK Web Archive, and the Icelandic Web Archive. These all respond with a 200 if you try to hit their TimeGate without a URI-R because their OpenWayback installations return a user-friendly page indicating how to search for a memento.

Then there are Memento-noncompliant archives for whom we have created Memento proxies, such as the Slovenian Archive and WebCite. These proxies should return a 400 if the URI-R is not encountered because it is an incomplete request (missing the appended URI-R) as far as they are concerned.

I have confirmed that adding 400 to the list on line 505 of aspects.py will currently fix the problem for noncompliant archives (like WebCite) and still allow you to catch 404s, 500s, etc.

The big question is: why did the WebCite test ever pass in the first place?

jayvdb commented 8 years ago

http://timetravel.mementoweb.org/webcite/timegate/ now returns a 400 instead of a 404, with a message

Service request does not contain '/timemap/' or '/timegate/'

That change occurred on our Travis build number 3310.10, on March 10. For future reference, do you know what commit in which repo made that change from 404 to 400?

So, yes, now we could add 400 to our list of 'acceptable' responses in aspects.py. I am not sure that is a good idea, as it would mean other 400 errors will also be ignored, and 400 is a bit too symptomatic of possible problem. We could certainly improve the error code management, so that 400 is acceptable only for the WebCite test class.

But first I will try to find when was the last time the WebCite test passed.

jayvdb commented 8 years ago

After downloading lots of build logs...

2858.2 ( https://github.com/wikimedia/pywikibot-core/commit/afdd8f9e72 ) is the last build to pass this test

2859.2 ( https://github.com/wikimedia/pywikibot-core/commit/e9d14578cf )-2865.2 ( https://github.com/wikimedia/pywikibot-core/commit/5795ed5b8 ) skipped the test with HTTPConnectionPool(host='timetravel.mementoweb.org', port=80): Read timed out. (read timeout=30)"

2866.2 ( https://github.com/wikimedia/pywikibot-core/commit/25980447cf45 ) is when it started failing with HTTP status: 404, but that is irrelevant.

2859.2 ( https://github.com/wikimedia/pywikibot-core/commit/e9d14578cf ) is where the relevant change occurred. It merged https://github.com/wikimedia/pywikibot-core/commit/0e996577, by @xZise , which added the hostname checking to the MementoTestBase class.

So in essence, the http check against 'http://timetravel.mementoweb.org/webcite/timegate/' has been failing as long as our test suite has been doing it. There is a small oddity in that it was giving timeouts and then switched to issuing 404 errors, but it is now using a sensible 400 error, which we can consider to be acceptable. However, an easier way is to give the test class a URL that should return a valid resource. e.g. https://gerrit.wikimedia.org/r/#/c/277213/ ( which passed on Travis with a slightly different patch that had a different commit message: https://travis-ci.org/jayvdb/pywikibot-core/jobs/115810668)