oduwsdl / MemGator

A Memento Aggregator CLI and Server in Go
https://memgator.cs.odu.edu/api.html
MIT License
55 stars 11 forks source link

A redirect is issued when the scheme prefix with double slashes is encountered #16

Closed ikreymer closed 8 years ago

ikreymer commented 8 years ago
curl "192.168.59.103:1208/timenav/json/http://example.com/"
<a href="/timenav/json/http:/example.com/">Moved Permanently</a>

Not too big an issue, but unexpected behavior. It should except the URL as is and not issue a redirect?

Also, I see that this redirects to a usage page, but perhaps it should default to using the current time as Accept-Datetime, if no timestamp is provided? This is consistent with TimeGate behavior when no Accept-Datetime is present and wayback behavior..

ibnesayeed commented 8 years ago

This is currently as designed, but I am open to change the flow based on feedback. Here is what is happening right now; the router does not recognize the endpoint only based on the first path component such as /timemap or /timenav, instead it matches the whole signature. For example, for a TimeNav request the path must match /timenav/link|json|cdxj/{YYYY[MM[DD[hh[mm[ss]]]]]}/{URI-R} signature where no path segment is optional, however, the datetime segment has internal defaults which allows flexibility there. Any request that does not match one of the four endpoint signatures is delegated to the default multiplexer that does some path canonicalization and cleanup. For the default Mux, the URI-R is nothing but part of the path. That path canonicalization in the standard Go HTTP library deals with multiple (non-leading) consecutive slashes and replaces them with a single one. Which then triggers a redirect to the canonical path.

Using URI-Rs in the path and not passing them as a URL-encoded query parameter itself introduces a lot of ambiguities. Additionally, if the use of the protocol scheme in the URI-R is optional, then provisioning optional path components can be dangerous and may accidentally lead to wrong predication. That's why I decided to have strict positions for each parameter in the path.

That said, we can redesign the flow in a way that identifies the endpoint based on the first component then check rest of the signature and if it fails to match the strict signature of the specific endpoint then issue a 400 Bad Request response instead of delegating it to the default multiplexer. I thought about this approach before, but decided to keep the routes flexible. Do you think we should change the behavior this way or do you have any better alternatives in mind?

ikreymer commented 8 years ago

I do like having some way to specify latest capture (or even earliest for that matter, though that seems harder).. It does make the regex more complicated, so perhaps instead of omitting the timestamp, it should be a special symbol like /json/now/http://example.com/ (which could default to timegate query with no ts, or just get the current system time).

For invalid requests, I think a direct 400 response would definitely be more preferably than a redirect to a some other page that shows an error.

ibnesayeed commented 8 years ago

Also, I see that this redirects to a usage page, but perhaps it should default to using the current time as Accept-Datetime, if no timestamp is provided?

I did thought about that as the TimeNav endpoint was a friendly equivalent of TimeGate, but this introduces complexities on both the ends. The client has to format a URL to request and having datetime optional will allow two variations which client can choose use based on the requirement, this choice has a cost associated with it of knowing about two formats (or possibly URL template), but the benefit is not much as it only saves from querying the current Time on the client side which might not be a big deal if the client has to deal with time objects in the successive requests anyway. Additionally, it will also complicate the URL parsing logic on the server side and may even cause ambiguities. I am not saying it is not doable, but TimeNav calls have multiple pieces of information in the URL which is not the case with TimeGate call where format is always fixed and time is sent separately in the header.

Additionally, it is consistent with the LANL's API, though, they don't call it TimeNav (I am fan of short names and consistency, so I gave it one).

ibnesayeed commented 8 years ago

I do like having some way to specify latest capture (or even earliest for that matter, though that seems harder)

This logic is currently off-loaded to the client. Any 200 response will have first and last mementos in the payload. Currently, to specify a time that has guarantee to be closest to the first memento is to specify a pre-archive date and for the latest memnto the current time will do the trick. This endpoint will almost always be used by intelligent client that knows the concept of time and deal with it. I don't see a utility of static links (considering plain HTML as dumb client) that point to TimeNav data. However, I do understand that it can be useful in Redirect service which I think will be better dealt if the Robust Links specs have something to say about it (I think I might toss that on the mailing list).

ibnesayeed commented 8 years ago

I think it is worth noting here that the CLI equivalent of TimeNav (or TimeGate) works by checking if an optional second argument is present after the URI-R argument. If that second argument is not present then it is considered a TimeMap request. However, if the second argumnet is present but it is not a valid datetime, it is agnored and TimeMap is returned. I think if the routing of the server is reworked, CLI should through error message if the second argument is present but not a valid datatime. And yes, the CLI won't allow any default datetime unless some sort of flag is added for that which will be inconsistent.