More efficient timegate query option

mementoweb / py-memento-client

A Memento Client Library in Python

Other

25 stars 6 forks source link

More efficient timegate query option #2

Open ikreymer opened 9 years ago

ikreymer commented 9 years ago

Great to see this library, I was just exploring using the get_memento_info for Netcapsule or Reconstruct and ran into the following issue.

While the get_memento_info is a great general purpose function, I think there needs to be a more optimized option that performs just the basic "Get memento from TimeGate at specified datetime" and nothing else. Ideally, this means having a single HTTP request to the TimeGate to get the desired info.

I propose adding an extra param, include_uri_checks, which will default to True, but when set to False, will disable any of the following additional checks.

Currently, the get_memento_info also includes the following:

A get_original_uri queries the original url to determine if its a URL-R or URl-M. This is unnecessary if the user knows that a URL-R is being passed in (and potentially slow). With include_uri_checks disabled, this test will be skipped.
Redirects are enabled on the TimeGate HEAD request, initiating potentially many redirects. This can be disabled to ensure just one request to the TimeGate. With include_uri_checks disabled, the value of Location or Content-Location header will be used instead.
Redirects are followed with a head request to the URL-M to get the Memento-Datetime header. This also brings additional overhead. Instead, the datetime is usually already available Link header response from the TimeGate, getting the datetime from the rel=memento entry. (As a side note, I was surprised to find that this is not required, though luckily present in most implementations -- seems like the timegate should always return the datetime in the link header, rather than force user to make another request to the URL-M)
With this option, automatic redirecting, eg. http://lanl.gov -> http://www.lanl.gov/worldview/ is disabled, as the users browser will follow the 302 explicitly and it will be reflected to the user.

For now, I've called this property include_uri_checks but can be changed to something else. If better to make a separate function, that can work as well.

With this option disabled, Reconstruct and Netcapsule can use start using this API, rather than relying on the existing Memento JSON api that's only on the aggregator.

I've also included a simple test file with demonstrates these changes.

Let me know if there are any questions/thoughts.

ikreymer commented 9 years ago

Here is an example for the use case, based on one of the test cases.

MementoClient('https://web.archive.org/web/', 
    check_native_timegate=False).get_memento_info(
    url='http://www.lanl.gov', 
    datetime.datetime(2003, 3, 19, 23, 59, 59))

results in the following traffic:

HEAD http://www.lanl.gov/ - 200
HEAD https://web.archive.org/web/http://www.lanl.gov - 302
HEAD https://web.archive.org/web/20030320195439/http://www.lanl.gov/worldview - 302
HEAD https://web.archive.org/web/20030401090900/http://www.lanl.gov/worldview - 302
HEAD https://web.archive.org/web/20030401203752/http://www.lanl.gov/worldview/ - 200

This is far too much traffic and unnecessary for memento replay.

With the new option include_uri_checks=False option,

MementoClient('https://web.archive.org/web/', 
    check_native_timegate=False).get_memento_info(
    url='http://www.lanl.gov', 
    datetime.datetime(2003, 3, 19, 23, 59, 59), 
    include_uri_checks=False)

there is just one request:

HEAD https://web.archive.org/web/http://www.lanl.gov - 302

(The replay system will then do a GET on the closest memento and play back the 302)

shawnmjones commented 9 years ago

After more consideration, we are trying to determine the impact of these changes.

ikreymer commented 9 years ago

Sure, I'd be opening to refactoring or just making this a separate function rather than a flag, if that would be cleaner.

shawnmjones commented 9 years ago

After careful consideration, we have decided not to incorporate these changes.

We carefully created this library as a reference implementation for a Memento client. Our intention was to make this library handle all edge cases, especially considering our initial customer was Wikipedia. For example, if the submitted URI is a URI-M, then if we merely append the URI-M to a TimeGate base URI, the chance is that a 404 will result. We want the library to provide a high quality URI-M, hence we follow the protocol strictly.

If you would like to discuss further, please email me, Harish, and/or Herbert directly and we will schedule a call for further discussion next week.

ikreymer commented 9 years ago

That is a disappointing response, I guess i won't be using this library then. If the library is only intended to be used with Wikipedia, it should specify as such. I don't really have time for a back-and-forth, was just trying to make this useful for a broader range of applications, namely, web archives.

As I explained above, the pull request does not change the default behavior, but adds an option to disable some of the assumptions you are making. For example, in the web archiving use case, any url can be a URL-R, so that check does not really make sense. A Memento that has a 3xx status code is still a memento, and for some use cases, it is important to return that and not follow the redirect .

The intent was to be abe to use the memento protocol in the simplest way possible: given a timegate, an original url, and a desired datetime, return the closest memento and its datetime, without making any other assumptions.

phonedude commented 9 years ago

Joining the conversation late, but just a quick note that while TimeGates will provide a Memento-Datetime header, it's often not the same value that you'll end up with when you get to the URI-M. Archives all the time advertise mementos at a particular datetime and then end up redirecting to another Memento-Datetime when dereferenced. If you really want to know the Memento-Datetime, you need to chase it all the way down to the URI-M.

ikreymer commented 9 years ago

That's certainly true, but it may be just as important to keep track of each memento and its datetime along the way.

Mostly this is just an argument for a simple low-level library function that performs one query to the TimeGate, and that it. It leaves the decision to follow redirects or not to the caller. If the use case is an HTTP traffic replay system, then each Memento (redirect or not) should be sent back to the client/browser as a single transaction, and it is up to the client/browser to decide what to do with each memento (eg. follow redirect). This allows for some interesting custom behavior: for example, a client may change its accept-datetime based on the memento-datetime of a redirect, or it may stop the redirect chain altogether if the memento-datetime exceeds a certain time boundary, etc... And a more simple reason is that without this, its not possible to properly replay HTTP request/response traffic because redirect responses would get filtered out.

Also a minor point, but according to spec, the TimeGate in Pattern 2.1 (the most commonly used approach in web archives and in memento aggregator) must NOT provide a Memento-Datetime header directly http://tools.ietf.org/html/rfc7089#section-4.2.1 (only URL-Ms do that), but it MAY provide it indirectly in the form of a Link header which may contain a rel=memento and a corresponding datetime. Luckily, most implementation seem to include this Link header but it is not required.

(Pattern 2.2 http://tools.ietf.org/html/rfc7089#section-4.2.2 does include the Memento-Datetime header in the TimeGate response, and for this reason and others, I would favor this approach for individual web archives, but that's another topic)

phonedude commented 9 years ago

I misspoke above when I said TimeGates return a Memento-Datetime header (if it were present a client couldn't tell an archived 302 from a live 302); as you mentioned they (optionally but typically) provide the Memento-Datetime value in the Link header. The exception as you note is the 200-style CN (pattern 2.2), but that's a different animal altogether (the most typical problem is that the browser won't update its URI with the value in Content-Location). Pattern 2.2 is defined and possible, but 200-style CN has pretty much fallen out of favor.

I think I understand what you're requesting, but (as you suggest above) perhaps instead of modifying MementoClient(), it would be better to have separate function(s) that disaggregates the steps. We kind of have something in that direction with "check_native_timegate=False", which short-circuits the HEAD on URI-R. Given that you can have X redirects before you finally hit a (real) TimeGate, and Y redirects between a TimeGate and a memento, perhaps the best thing is to have a function that takes as an input the previous HTTP response and then does the next (and only the next) step while returning the resulting HTTP response (so it can be used as an input to the next call).

Essentially, MementoClient() implements "curl -L" and the proposed function simply removes the "-L" flag and requires an explicit call to take the next step. This would allow an application (with all appropriate caveats) to be able to bail out of the Memento chain at any point without committing to chasing it down to some terminal response code (and maybe with some return values indicating its best guess as to where it is in the chain, e.g. "past a TimeGate but haven't hit a memento yet"). Or maybe it could allow for resuming a stalled / checkpointed Memento request chain without having to start from the beginning.

Although it was my understanding that the purpose of this lib was to hide all the Memento chain business in a full up client. Perhaps it should be a separate library, but it's also nice to have something that does the right thing with Accept-Datetime and all the resulting headers.

ikreymer commented 9 years ago

The curl analogy is a great one. If the MementoClient is like curl -L, I just want to be able to turn off the -L flag, and I hope the resolution wouldn't be to 'write your own curl without the -L flag'. :)

I'm not sure if the chain / previous response thing is necessary, at least, for implementing a server-side replay system, each HTTP request/response is discreet anyway. The use case is that the MementoClient performs lookup for the HTTP server, one request/response at a time. It's essentially like a proxy between the HTTP server handling software and a Memento service, fulfilling the role of communicating between.

We're planning to discuss next week, Shawn will organize the call.

I'll send you a separate note re: pattern 2.1 / pattern 2.2 thoughts

phonedude commented 9 years ago

[just thinking out loud]

I would think if you allow for the process to be interrupted (i.e., no "-L"), then it would make sense to provide support in the library to resume the process as well. Eventually that response will get written to disk somewhere and then someone will come along later & want to pick it up again, but to do so they'll have to parse the headers directly (and likely do something wrong).

Also I think there is a certain elegance to allowing a debugging/step-through approach that can be used to produce the same final result as the full MementoClient() if the program explicitly chooses to follow the next step. It just seems like something that should be possible, even if it is not frequently done.

shawnmjones commented 9 years ago

Reopening after discussion with Herbert.

shawnmjones commented 9 years ago

Herbert is back from Europe. We all had a discussion about this today and have resolved that a new method to the MementoClient class is the best path forward. We suggest the name get_location_from_timegate and a call signature of the form:

mc.get_location_from_timegate(request_uri, accept_datetime)

where request_uri is a URI-R and accept_datetime is obviously Accept-Datetime

This function would then perform no checks on the URI-R, and just capture the first response from the location header presented by the TimeGate. It would also perform no checks against the URI returned from the location header, leaving that up to the discretion of the calling code.

We would prefer that response be a Python dictionary in the format of the JSON API (http://timetravel.mementoweb.org/guide/api/). In this case, the output would just contain something like the following:

{
  "original_uri":  "http://www.example.com/doc1.html",
  "timegate_uri":  "http://timetravel.mementoweb.org/timegate/http://www.example.com/doc1.html",
  "mementos":
    "closest": {
        "datetime": "",
        "http_status_code": "",
        "uri" : [
          "http://web.archive.org/web/19700101000000/http://www.example.com/doc1.html"
        ]
    }
}

The value for the TimeGate is specified when the object is instantiated using the existing timegate_uri. There is no datetime or HTTP status code present in the JSON output, because this information is acquired by making a request to the URI-M.

Also, in creating get_memento_info, we created additional methods for acquiring information throughout the process, that you may find useful in HTTP playback. Existing functions that may be helpful:

get_native_timegate_uri, used for sites whose URI-R headers contain their own timegates; takes a URI-R and an Accept-Datetime; then makes a request and reads the headers
is_timegate, used to determine if the URI-G specified does actually specify a TimeGate; takes a Python response object to avoid additional requests; will use the response object if present, otherwise sends a request to the URI-G
is_memento, used to determine if the URI points to a Memento; takes a URI-M and a Python response object to avoid additional requests; will use the response object if present, otherwise sends a request to the URI-M
__prepare_memento_response, used to generate the response in the JSON API format

Is the new proposed method and the existing available functions helpful for the use case described?

ikreymer commented 9 years ago

Thanks for revisiting this. Yes, I agree a separate method makes sense and that can easily be done using the existing functions. I can refactor and submit another pull request.

But, for the response, though, the datetime should of course be included, as that's an essential part of this use case!

I would like the result to be exactly the same as the aggregator's JSON api, so the data will be read from the Link header (which hopefully includes the 'closest' entry, though unfortunately that may not always be as I mentioned above).

Ex: http://timetravel.mementoweb.org/api/json/20030319/http://www.lanl.gov

This returns the closest memento and other available mementos, and if you get the closest URL-M with 'curl -I -H "Accept-Datetime: Thu, 20 Mar 2003 19:54:39" "http://web.archive.org/web/http://www.lanl.gov"', you'll see that it's a 302. The JSON API just returns exactly this info about the closest memento without following any redirects. This is what I am looking for.

Again, this exact functionality already exists in the memento aggregator JSON API, I would just like this available directly from a TimeGate endpoint, w/o having to use an aggregator.

This way, someone can use the MementoClient to implement their own aggregator system, for instance.

ikreymer commented 9 years ago

Basically, I just wanted to use the library be able to query a timegate, the equivalent of:

curl -I -H "Accept-Datetime: Thu, 20 Mar 2003 19:54:39" "http://web.archive.org/web/http://www.lanl.gov"

which includes this header:

Link: <http://www.lanl.gov>; rel="original", <http://web.archive.org/web/timemap/link/http://www.lanl.gov>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/19961221031231/http://www.lanl.gov>; rel="first memento"; datetime="Sat, 21 Dec 1996 03:12:31 GMT", <http://web.archive.org/web/20030219210359/http://www.lanl.gov>; rel="prev memento"; datetime="Wed, 19 Feb 2003 21:03:59 GMT", <http://web.archive.org/web/20030320195439/http://www.lanl.gov>; rel="memento"; datetime="Thu, 20 Mar 2003 19:54:39 GMT", <http://web.archive.org/web/20030323014913/http://www.lanl.gov>; rel="next memento"; datetime="Sun, 23 Mar 2003 01:49:13 GMT", <http://web.archive.org/web/20150926154207/http://www.lanl.gov>; rel="last memento"; datetime="Sat, 26 Sep 2015 15:42:07 GMT"

I just want the library to return the value of such a Link header as a JSON/python dictionary, which is exactly what the aggregator JSON API already does. That really is it!

Quite frankly, I don't really understand why this is at all complicated or controversial. ;)

shawnmjones commented 9 years ago

Several tests with http://www.cnn.com on Memento-compliant archives actually do return a memento relation, indicating that it is available for acquiring the Memento-Datetime. So, in those case, the Memento-Datetime is accessible from the TimeGate response. If the Accept-Datetime is not submitted as a header, then no memento relation is present, but there is last memento which can be used.

Obviously, TimeGates for noncompliant archives, such as http://webcitation.org, will not have this relation, so there is no way to get this information.

I looked at RFC-7089 and it does not require a memento (or last memento) relation in the Link header for Pattern 1.1 or 2.1, so sites may not consistently list this relation, requiring one to make a request of the URI-M for the Memento-Datetime. The only required relation is original, and it explicitly states that The response MUST NOT contain a "Memento-Datetime" header. The method get_memento_info holds close to this spec.

The static parse_link_header function takes the Link header and breaks it into a Python dictionary. The method __prepare_memento_response allows you to pass in the link header as a string along with the URI-M, the Memento-Datetime, and a status code for the URI-M.

So, with all of this in mind, do you see any issues with this procedure?

perform the request on the URI-G with the Accept-Datetime
get the URI-M from the Location header in the response
parse the link header using parse_link_header to get the Memento-Datetime from the memento relation or the memento last relation, depending on whether Accept-Datetime was submitted
use __prepare_memento_response, passing it the URI-M, the Memento-Datetime, and the Link header to produce the JSON response

This gets almost everything needed to fill in the JSON. The only item lacking is the _http_statuscode, which you will not know until the URI-M is dereferenced. It is set to None if not passed to __prepare_memento_response.

Take care of the error handling and I think this will work. :-D

ikreymer commented 9 years ago

Several tests with http://www.cnn.com on Memento-compliant archives actually do return a memento relation, indicating that it is available for acquiring the Memento-Datetime. So, in those case, the Memento-Datetime is accessible from the TimeGate response. If the Accept-Datetime is not submitted as a header, then no memento relation is present, but there is last memento which can be used.

Ah right, in this special case last memento == closest, so it should be used instead.

Obviously, TimeGates for noncompliant archives, such as http://webcitation.org, will not have this relation, so there is no way to get this information.

By non-compliant I assume you mean those that don't include the optional relations but otherwise follow the protocol? ... if they don't comply with the protocol, then all bets are off, right? ;)

I looked at RFC-7089 and it does not require a memento (or last memento) relation in the Link header for Pattern 1.1 or 2.1, so sites may not consistently list this relation, requiring one to make a request of the URI-M for the Memento-Datetime. The only required relation is original, and it explicitly states that The response MUST NOT contain a "Memento-Datetime" header. The method get_memento_info holds close to this spec.

Yep. I alluded to this as well. I think this is an unfortunate oversight. I would say this is even backwards, seems that the memento or closest memento relation should be the required one, rather than the original. A user likely knows the original one as they're querying the TimeGate with it, and the whole point of talking to the TimeGate to get the URL-M and the its datetime. Luckily for now, most active memento deployments (those based on OpenWayback or pywb) do include the memento or last memento relation.

Yep, I think the below steps look good, just added extra conditions for Pattern 2.2, 2.3, etc..., which could have a Content-Location and a Memento-Datetime headers.

perform the request on the URI-G with the Accept-Datetime
get the URI-M from the Location header or Content-Location header in the response
parse the link header using parse_link_header to get the Memento-Datetime from the memento relation or the memento last relation, depending on whether Accept-Datetime was submitted or use actual Memento-Datetime header, if present
use __prepare_memento_response, passing it the URI-M, the Memento-Datetime, and the Link header to produce the JSON response

Do we know of archives that support memento protocol but do not include any relations?

ikreymer commented 9 years ago

Alternatively, if there's no rel=memento, but there are other mementos, the approach may be to sort any present memento relation by time distance from the Accept-Datetime and compute the closest that way... It seems that archive.is for instance, does not include a rel=memento relation, but includes a prev, next, first, last and one of those is the closest...

Again, an unfortunate side effect of not requiring the datetime to be unambiguously included in every TimeGate response..