iipc / openwayback

The OpenWayback Development
http://www.netpreserve.org/openwayback
Apache License 2.0
475 stars 273 forks source link

Feedback on CDX Server requirements page #305

Open ikreymer opened 8 years ago

ikreymer commented 8 years ago

Hi, I wanted to give feedback on the CDX Server requirements wiki page.

There's not really a good way to comment on the page though, so rather than just editing the wiki page, I thought it'd be easier to start a conversation as an issue. Feedback follows as comments.

As part of making the CDX-Server the default index engine for OpenWayback we need to clean up and formally define the API for the CDX-Server. This document is meant as a workplace for defining those API's.

I think that's a great idea, especially this API can be shared across multiple implementations, not just OpenWayback.

The CDX-Server API, as it is today, is chracterized by a relatively close link to how the underlying CDX format is implemented. Functionality varies if you are using traditional flat CDX files or compressed zipnum clusters. One of the nice things by having a CDX Server is to separate the API from the underlying implementation. This way it would be relatively easy to implement indexes based on other technologies in the future. As a consequence we should avoid implementing features just because they are easy to do with a certain format if there is no real need for it. The same feature might be hard to implement on other technologies.

The intent was to keep it separate (and there is support for different output formats, eg. JSON lines). The zipnum cluster does provide extra APIs, such as Pagination, but that is mostly because pagination is otherwise technically difficult without a secondary index, nothing ties it to zipnum cluster implementation in particular. The 'secondary index' is presented as a separate concept and perhaps could be kept abstracted out further.

The API should also try to avoid giving the user conflicting options. For example it is possible, in the current api, to indicate match type both with a parameter and a wildcard. It is then possible to set matchType=prefix and at the same time use a wildcard indicating matchType=domain.

Sure, the wildcard query was added a 'shortcut' in place of the matchType query, 'syntactic sugar', but if people feel strongly about removing one or the other, I don't think its a big deal

The following is a list of use-cases seen from the perspective of a user. Many of the use-cases are described as expectations to the GUI of OpenWayback, but is meant to help the understanding of the CDX-Server's role. For each use-case we need to understand what functionality the CDX-Server is required to support. CDX-Server functionality with no supporting use-case should not be implemented in OpenWayback 3.0.0.

This is a work in progress. Edits and comments are highly appreciated.

The CDX Server API was not just designed for GUI access in OpenWayback, but a more general API for querying web archives. The interactions from a GUI in OpenWayback should be thought of as a subset of the functionality that the API provides. Everything that was in the API had a specific use case at one point or another.

As a starting point, the CDX Server API provides two APIs that are defined by memento:

The closest match functionality is designed to provide an easy way to provide the next closest fallback, if replay of the first memento fails, and allow for trying the next best, and so forth..

Another use case was better support for the prefix query, where the result is a list of unique urls per prefix, followed by the starting date, end date and count. The query can then be continued to get more results from where the end of the previous query.

Another important use case is parallel bulk querying, which can be used for data extraction. For example, a user may wish to extract all captures by host, prefix, or domain across a very large archive. The user can create MapReduce job to query the CDX server in parallel, where each map task sets the page value. (Implementations of this use case already exist in several forms).

The difference between the bulk query and the regular prefix query, is that the pagination api allows you to query a large dataset in parallel, instead of continuing from where the previous query left off. But this requires pagination support, which requires the zipnum cluster, but it would in theory be possible to support without (just requires do a lot more work to sample the cdx to determine page distribution).

Another use case was resolving revisit records, if the original was the same url, in a single pass, to avoid having to do a second lookup. This is done by appending the original record as extra fields. This may be not as useful if most deduplication is 'url agnostic'

_

Use-cases

1. The user has a link to a particular version of a document

This case could be a user referencing a document from a thesis. It is important that the capture referenced is exactly the one the user used when writing the thesis. In this case the user should get the capture that exactly matches both the url and timestamp.

This is more of a replay system option, rather than cdx query. What happens if the exact url doesn't exist? There is not a way to guarantee exact match just by url and timestamp, you would also need the digest, and you can filter by url, timestamp and digest with cdx server, but not with a replay (archival url) format.

2. The user selects one particular capture in the calendar

Pretty much the same as above, but it might be allowed to return a capture close in time if the requested capture is missing.

I think this is not at all the same as above, but closest capture/timegate behavior. An option could be added to remove closest match and only do exact match, but again, this is a replay system option, not cdx server option..

3. Get the best matching page when following a link

User is looking at a page and want to follow a link by clicking it. User then expects to be brought to closest in time capture of the new page.

4. Get the best match for embedded resources

Similar to above, but user is not involved. This is for loading embedded images and so on.

It seems that these all fall under the 'closest match' / Memento TimeGate use case

5. User requests/searches for an exact url without any timestamp, expecting to get a summary of captures for the url over time

The summary of captures might be presented in different ways, for example a list or a calendar.

Yep, this is the Memento TimeMap use case.

6. User looks up a domain expecting a summary of captures over time

7. User searches with a truncated path expecting the results to show up as matching paths regardless of time

8. User searches with a truncated path expecting the results to show up as matching paths regardless of time and subdomain

These are all different examples of the prefix query use case.

9. User navigates back and forth in the calendar

This is already possible with the timemap query, right?

But, could also add an "only after" or "only before" query, to support navigating in one direction explicitly.

10. User wants to see when content of a page has changed

This seems more like a replay api, as cdx server is not aware of embeds or relationships between different urls.

johnerikhalse commented 8 years ago

Thanks @ikreymer for the valuable input.

First of all, this is a work in progress so there are probably a lot of use-cases to be written including use-cases for processing by a Map-Reduce framework. To get started, I begun to write up some use-cases based on the end-user experience on OpenWayback. As you correctly mention, several of these use-cases can be supported by the same set of functionality from the CDX-Server. I just wanted the list to be as complete as possible.

While working with OWB and the CDX-Server, it is a challenge for me to understand all the options supported by the CDX-Server. Both because several parameters are not documented, but also because I do not understand the use-case for all parameters. I might be wrong, but my impression is that sometimes functionality has been implemented to work around shortcomings which preferably should have been fixed other places. I think that the 3.0.0 version could be a good time to clean up even if it breaks backward compatibility. After all the current CDX-Server is still marked as BETA.

I was also thinking of including a comment for each use-case if the use-case was valid or doesn't make sense. This way we would keep track of ideas that have been rejected.

The main goal with this document is to make it easier for new people (like me) to participate in coding. The use-cases written so far only reflects my understanding. I would love if you (or anybody else) add, remove or improve use-cases.

One thought I have, is whether it would be better to split up the API into more than one path and reduce the number of parameters. For example from/to parameter doesn't make sense combined with sort=closest. Instead the closest parameter is used for the timestamp. I don't think wildcard queries should be combined with closest either (at least I have not found the use-case). Maybe we could split API into something like /query which allows a timestamp parameter and a /search which allows from/to parameters and wildcards and/or matchType. This way the API would be more explicit and less code for checking combinations of parameters would be needed on server side.

I would also like to know the use-cases for parallel queries with the paging API. If i'm not mistaking, the current implementation could give different results for the paging API from the regular API if you use collapsing. This is because the paging API will not know if the starting point for a page, except for the first one, is in the middle of several lines to be collapsed. Maybe a separate path (e.g. /bulk) should added which do not allow manipulation like collapse?

For revisits, I wonder if that is a task for the CDX-server to resolve. I see the efficiency with one pass as you mention, but also the problem with url agnostic revisits. Maybe that should either go into a separate index keyed on digest, or maybe as extra fields when creating CDX-files instead of adding them dynamically.

ikreymer commented 8 years ago

Thanks @johnerikhalse It sounds like you'd prefer editing in the wiki directly. I can move some of my comments to the page, and add additional use cases. Another option might be to have an issue for each proposed cdx server change or suggestions, so that folks can comment on additional suggestions as well, and then at the end, modify the wiki based on the suggestions.

One thought I have, is whether it would be better to split up the API into more than one path and reduce the number of parameters. For example from/to parameter doesn't make sense combined with sort=closest. Instead the closest parameter is used for the timestamp. I don't think wildcard queries should be combined with closest either (at least I have not found the use-case). Maybe we could split API into something like /query which allows a timestamp parameter and a /search which allows from/to parameters and wildcards and/or matchType. This way the API would be more explicit and less code for checking combinations of parameters would be needed on server side.

Hm, i think that from/to can still make sense with sort closest, if you wanted to restrict the results to a certain date range (not certain if that works currently). The sort closest is definitely a special case.

I would also like to know the use-cases for parallel queries with the paging API. If i'm not mistaking, the current implementation could give different results for the paging API from the regular API if you use collapsing. This is because the paging API will not know if the starting point for a page, except for the first one, is in the middle of several lines to be collapsed. Maybe a separate path (e.g. /bulk) should added which do not allow manipulation like collapse?

Yes, the collapsing would just happen on that specific page, as that is all that is read.. I can see the benefits of a separate 'bulk' api in some sense, though removing functionality may not be best approach.

I think the cdx server APIs should be thought of as a very low level api, designed to be used by other systems and perhaps advanced users. As such, there could be combinations don't quite make sense. There could be higher level APIs, such as the 'calendar page api' or 'closest replay api' which prevent invalid combinations. Similarly, there could be a 'bulk query api' which helps the user run bulk queries.

For revisits, I wonder if that is a task for the CDX-server to resolve. I see the efficiency with one pass as you mention, but also the problem with url agnostic revisits. Maybe that should either go into a separate index keyed on digest, or maybe as extra fields when creating CDX-files instead of adding them dynamically.

Yeah, I suppose this can be removed and it would be up to the caller to resolve the revisit. Or, better yet, there could be a separate 'record lookup API' which queries the cdx API once or more times to find the record and the original (if revisit).

johnerikhalse commented 8 years ago

Another option might be to have an issue for each proposed cdx server change or suggestions, so that folks can comment on additional suggestions as well, and then at the end, modify the wiki based on the suggestions.

I agree that we should have an issue for each change in the cdx server. I also agree that cdx server is a pretty low level API. The document I started writing is meant as a background for proposing changes. It will hopefully be easier to see why a change is necessary when you can relate it to one or more use cases. It wasn't my intention to describe the cdx server in that document. That should go into a separate documentation of the API. Usually fulfilling the requirements of a use case will be a combination of functionality in the cdx server and the replay engine.

I think the cdx server APIs should be thought of as a very low level api, designed to be used by other systems and perhaps advanced users. As such, there could be combinations don't quite make sense. There could be higher level APIs, such as the 'calendar page api' or 'closest replay api' which prevent invalid combinations. Similarly, there could be a 'bulk query api' which helps the user run bulk queries.

Even if the API is low level, I think it's important that users get predictable results. For example, I think it's a bad idea if a function return different results if you are using paging or not. Another example would be if a function return different results for a configuration with one cdx-file from a configuration with multiple cdx-files.

I have a question about match type. I can see the justification for exact, prefix and domain, but host seems to be just a special case of prefix. Is that correct? According to current documentation you can use wildcards to indicate prefix and domain and no wildcard for exact. If host isn't necessary, it is possible to avoid the matchType parameter all together.

johnerikhalse commented 8 years ago

One comment to how low level the cdx server api is. I agree that it is to low level for most end users, but there is built in logic assuming certain use cases which, in my opinion, doesn't qualify for calling it very low level. For example filtering is done before collapsing (which makes sense). Given the query parameters filter=statuscode:200&collapse=timestamp:6, you could get quite different results if filtering was done after collapsing. If someone comes up with a use case where filtering should be done after collapsing, or perhaps one filter before collapsing and another filter after, that would not be possible because cdx server has this built in assumption of some use cases.

For a real low level api we should implement a kind of query language, but I think that is to overcomplicate things. I think it is better to try to document real world use cases and create enough functionality to support them. It is important tough to document those built in assumptions.

Makes sense?

ikreymer commented 8 years ago

Yes, a full query language may be a bit too much at this point. If collapsing is the main issue, perhaps it can be taken out, except in a very specific use case. For example, I believe the collapseTime option (not documented) was added specifically to allow a specific optimization of skipping secondary ZipNum blocks.. I think it makes sense to examine the collapsing and see if it belongs, and perhaps document certain optimizations. Most of the optimizations were created in the context of IA wayback scale cdx queries, and may not even be relevant for smaller deployments.

As a reference, in the pywb cdx server API, I have implemented CDX features only as they became needed (https://github.com/ikreymer/pywb/wiki/CDX-Server-API) resulting in a smaller subset of the current openwayback cdx server API. Collapsing has not really been needed, and it is something that I think may best be done of the client.

For example, a calendar display may offer dynamic grouping as needed as user switches different zoom levels, w/o having to make a server-side request each time.

Also, taking a look at http://iipc.github.io/openwayback/api/cdxserver-api.html I would strongly advise against going in this direction. The /search /query and /bulk distinctions are extremely confusing, (not to mention search and query are often used synonymously). It appears that under this scheme, there would be 3 different ways to query the same thing:

http://cdxserver.example.com/cdx/main/search?url=foo.com
http://cdxserver.example.com/cdx/main/query?url=foo.com
http://cdxserver.example.com/cdx/main/bulk?url=foo.com

which would make things more complicated. I think that most of the CDX server features can be combined (collapsing perhaps being an exception), and when they can't, that can be documented. I don't think there is a need for multiple endpoints at this time.

johnerikhalse commented 8 years ago

Thanks for the comments @ikreymer

For example, I believe the collapseTime option (not documented) was added specifically to allow a specific optimization of skipping secondary ZipNum blocks.

If it is just an optimization not altering the results compared to the general collapse function, then I think such optimizations should be done by analyzing the query and not by adding more parameters.

Collapsing has not really been needed, and it is something that I think may best be done of the client.

For example, a calendar display may offer dynamic grouping as needed as user switches different zoom levels, w/o having to make a server-side request each time.

That's true if we by client understand the browser. If the client is OpenWayback, then I think a new roundtrip to the Cdx Server is better to avoid keeping state in OWB.

The /search /query and /bulk distinctions are extremely confusing, (not to mention search and query are often used synonymously).

I agree that the names 'search' and 'query' are not good names (suggestions welcome). But I do think the split makes sense. The query path is ment to support the Memento TimeGate use case and the search path supports the Memento TimeMap as described in your first posting on this issue.

It appears that under this scheme, there would be 3 different ways to query the same thing: http://cdxserver.example.com/cdx/main/search?url=foo.com http://cdxserver.example.com/cdx/main/query?url=foo.com http://cdxserver.example.com/cdx/main/bulk?url=foo.com

No, they are not the same.

ikreymer commented 8 years ago

If it is just an optimization not altering the results compared to the general collapse function, then I think such optimizations should be done by analyzing the query and not by adding more parameters.

The optimization does alter the results, I believe it allows for skipping of secondary index zipnum blocks when the data was too dense (eg. over 3000+ captures of same url within one minute/hour/day, etc...)

I agree that the names 'search' and 'query' are not good names (suggestions welcome). But I do think the split makes sense. The query path is ment to support the Memento TimeGate use case and the search path supports the Memento TimeMap as described in your first posting on this issue.

Well, you could use timegate and timemap, but the cdx server does a lot more than memento, so that may only add to confusion. The difference between these is 'list all urls' and 'list all urls sorted by date closest to x', so in this case closest=<date> makes the most sense to user.

Even if the names were different, switching between query <-> search just changes the default value of the sort field, right? I think just makes it more complicated then allowing user to specify sort, as it does now (and the user can still change the sort order). This doesn't seem to solve any problem.

The separate bulk endpoint might make sense, if there is a good reason that the page= counter would not suffice.

The bulk request will return a list of urls to batches which could be retrieved by subrequests. The reason to return a list of urls is to allow the Cdx Server to spread load to several instances if such functionality is to be implemented.

That is a very dubious reason for such a big change. Modern load-balancers can spread the load automatically amongst worker machines. This is like arguing for going back to the era of using www2. www3. prefixes to spread the load on a web site.

Also, when is a query considered 'bulk'? Suppose a user wants to get all results for a single url, there could be 1 result or there could be, say 50,000. Should they use the bulk query or the 'non-bulk' query? If they use the non-bulk query, will they see all the results at once, or will it be cut-off at some arbitrary limit (since there is no pagination). If it's all the results, but there is no pagination, the result may be much larger then expected and will take longer to load. If there is a cut-off, then the user may not get all the results. Thus, it seems like there is no reason to use the non-bulk query at all when the pagination supporting bulk query is available..

In the current system, page=0 is implied, so the user will see the first page of results. It is also possible to query and set the page size, so the user has an idea of the max number of results to expect. (This can be improved too).

Again, I would suggest ways to improve the current API rather than creating distinct (and often conflicting) endpoints that only add more user decision (bulk vs non-bulk, search vs query, etc..) and do not actually add any new features.

johnerikhalse commented 8 years ago

The optimization does alter the results, I believe it allows for skipping of secondary index zipnum blocks when the data was too dense (eg. over 3000+ captures of same url within one minute/hour/day, etc...)

I'm not sure if I understood this correctly, but I wondered if the result was altered different from using ordinary collapse or if collapsetime is just an optimisation for collapse=time. In the latter case I think the server should be smart enough to optimize just by looking at what fields you are collapsing on.

Even if the names were different, switching between query <-> search just changes the default value of the sort field, right?

Looking at http://iipc.github.io/openwayback/api/cdxserver-api.html, I realized that it might not be obvious that the methods (the blue get boxes) are clickable to get a detailed description of the properties. I added a label above the column to make it clearer. I also updated the descriptions somewhat. The reason I mention this is because there are other differences than just the sort which hopefully should be clear from the documentation and not needed to be repeated here. If that is not the case, then feedback is valuable to enable me to enhance it.

The bulk request will return a list of urls to batches which could be retrieved by subrequests. The reason to return a list of urls is to allow the Cdx Server to spread load to several instances if such functionality is to be implemented.

That is a very dubious reason for such a big change. Modern load-balancers can spread the load automatically amongst worker machines. This is like arguing for going back to the era of using www2. www3. prefixes to spread the load on a web site.

Well, I might have been a little too fast when stating that as the primary reason, it should be more like a possibility. Another possible use could be to hand bulk requests over to dedicated servers for that purpose. Anyway, the primary reason was to get closer to the guidelines for REST. REST advocates the use of absolute urls for references. That leaves less work on the client and ease the evolution of the api. Even though the Cdx Server is not restful, I think following the guidelines where it is possible is reasonable.

Also, when is a query considered 'bulk'? Suppose a user wants to get all results for a single url, there could be 1 result or there could be, say 50,000. Should they use the bulk query or the 'non-bulk' query? If they use the non-bulk query, will they see all the results at once, or will it be cut-off at some arbitrary limit (since there is no pagination).

If it's all the results, but there is no pagination, the result may be much larger then expected and will take longer to load. If there is a cut-off, then the user may not get all the results.

The bulk api is not meant to be used by OpenWayback, but by processing tools, and yes it definitely supports paging which the other apis in the proposal don't.

To not discuss everything in one issue, I created a separate issue for the bulk api (#309) where I try to describe it in more detail.

ldko commented 8 years ago

In response to the CDX Server API, I think I understand the motives behind having both a /get and a /search but don't wholly agree with the need. It looks like the main difference is that /get is meant to look up an "exact URL," but because of the nature of web archives in which a URL in itself does not clearly reference a single resource, requests to /get are still negotiating a set of results by use of query parameters very similar to those used at /search.

Also, if I understand, you are saying for /get requests:

Collapse is not allowed since it is not clear in which order the captures are returned.

Not having collapse as a parameter for /get seems to somewhat go against what is in the examples in the README of wayback-cdx-server-webapp where collapse is explained to play a significant role in calendar pages. Based on the current API proposal, is /get what would be used for calendar pages?

Regarding /bulk:

Anyway, the primary reason was to get closer to the guidelines for REST. REST advocates the use of absolute urls for references.

Many RESTful systems use pagination. If the next and previous pages are discoverable from a page, then it does not violate REST. Previous and next links can be given in HTTP Link headers and/or in the body of the response (would be relatively normal looking in a JSON response); the total number of results could also be returned in the headers or the response body. What I see currently in the proposal for /{collection}/bulk looks similar to the sort of request you would have if you just used pagination on /search, so I am not sure a separate path needs to be used.

ikreymer commented 8 years ago

I agree with everything @ldko stated. There just does not seem to be a good reason to split api into several endpoints at this point.

Most of the options can interact with all the other options, including collapse As I recall, it was specifically implemented to support grouping by date, and perhaps the collapse and collapseTime confusion is definitely one area that can be improved... (and its been many years since I looked at it in quite detail)

My point was that if /bulk has pagination why shouldn't /search? Saying an API is only for external use also doesn't seem like a good enough reason.. Suppose someone wants to use the bulk API with openwayback directly. For example, a UI could be added to allow users to get a quick estimate of the number of pages.. (IA wayback had this at one point).

There can definitely be improvements to the showNumPages query though. One is to make it a JSON object, as has been done in the pywb implementation.

An example of it is seen here: http://index.commoncrawl.org/CC-MAIN-2015-48-index?url=*.com&showNumPages=true {"blocks": 444486, "pages": 88898, "pageSize": 5}

For IA, this query: http://web.archive.org/cdx/search/cdx?url=*.com&showNumPages=true returns 16 million+ pages. If you add a page listing, you would have to list all 16 million urls, where the only thing that changes is the page number... Sure, you can add pagination to that, but why bother making this change?

A lot of thought went into making the original API as it is, and there are already a few tools that work with the current api (for bulking querying) so I think any significant changes should have a clear positive benefit.

johnerikhalse commented 8 years ago

Response to @ldko

Based on the current API proposal, is /get what would be used for calendar pages?

No, get is for following a single resource. Search is for calendar pages.

Many RESTful systems use pagination. If the next and previous pages are discoverable from a page, then it does not violate REST.

In the current implementation you get the number of pages by adding the showNumPages which does not give any discoverable uris. Instead you need to know how to build the query and what combinations are meaningful. You are right that it doesn't violate REST if a header leads you to the next page, but then processing in parallel is not possible because you need to request the pages in sequence.

/get is meant to look up an "exact URL," but because of the nature of web archives in which a URL in itself does not clearly reference a single resource, requests to /get are still negotiating a set of results by use of query parameters very similar to those used at /search.

Yes, Cdx Server is not RESTful, and will probably never be because there is no way to reliably address resources. On the other side, I think using concepts from REST where it is possible could ease the usage of the api. /get is as close as we get to address a resource. The main difference between /search and /get is that the latter needs to sort result closest to a certain timestamp. That causes much work on the server for big result sets. Because of that I have tried to design it to avoid as much as possible to work with such big results.

Response to @ikreymer

For IA, this query: http://web.archive.org/cdx/search/cdx?url=*.com&showNumPages=true returns 16 million+ pages. If you add a page listing, you would have to list all 16 million urls, where the only thing that changes is the page number... Sure, you can add pagination to that, but why bother making this change?

That is true if the current way of implementing paged queries shall be used. As stated in #309 it is up to the server to choose how many batches to split the result into. My intention was to not return a bigger set of batches than is needed for a good distribution in a processing framework. That's why each batch might have resumption keys since each batch might be quite big.

A lot of thought went into making the original API as it is, and there are already a few tools that work with the current api (for bulking querying) so I think any significant changes should have a clear positive benefit.

I do not question that and I really like the modularity the Cdx Server brings to OpenWayback. To me it looks like the api was thought through, but also has evolved over time. Some functions seems to be added to solve a particular requirement without altering existing functions. This is of course reasonable to avoid breaking backwards compatibility. But since we now want to change the status of Cdx Server from an optional part of deployment to beeing the default, we also want to remove the beta label. That is a good time to look into the api once more to see if it still serves the requirements and also can meet the requirements of the foreseeable future.

johnerikhalse commented 8 years ago

Since this discussion has turned to be about the justification for one or many entry points, I think I should recap and try to explain why I came up with the suggestion of more than one entry point. By just looking at the current api, I can understand why there are some objections. Using different entry points is not a goal in itself, it just felt like a natural way of solving some issues I came across while playing with the api and reading the code. I'll start with some examples form the current api:

First to follow up on @ikreymer's example: http://web.archive.org/cdx/search/cdx?url=*.com&showNumPages=true I then tried: http://web.archive.org/cdx/search/cdx?url=*.com&page=0. That query took about seven minutes to process giving ~15000 captures. I guess that's because of disks spinning down or some other IA specific issue. Subsequent queries was more in the 20-40 seconds range. Still I think this is to slow. So my first goal was to look into how things could be faster.

Then I tried different sorting:

Since none of these queries returned a resume key, I expected the results to be complete and only differ in order. That was not the case. The first query returned 464023 captures while the second returned 423210 captures.

So maybe resume key isn't implemented in IA's deployment. But using paging should give me the complete list. I start by getting the number of pages:

Both queries return 36 pages which is fine.

Then I created a loop to get all the pages like this:

for i in `seq 0 36`; do
    curl "http://web.archive.org/cdx/search/cdx?url=hydro.com/*&page=${i}" > cdxh-asc-p${i}.txt
    curl "http://web.archive.org/cdx/search/cdx?url=hydro.com/*&sort=closest&closest=20130101000000&page=${i}" > cdxh-closest-p${i}.txt
done

Even though non of the queries was executed in parallel I got several failures caused by stressing the server to much. Of course I don't know what other queries was executed on the server, but it seems like it already got to much to do.

I concatenated the page queries and expected the result from non-paged queries to be the same as those from the concatenated, paged queries. At least I only expected differences at the end of the result, if the non-paged result was cut off by a server limit. Unfortunately that was not the case. Even though the paged queries returned more results (as was expected if the non-paged query was cut off), they also missed a lot of captures which was present in the non-paged result.

I also recognized that there where duplicates in the responses for the normal sort (both paged and non-paged) and the result wasn't sorted in ascending order, which could be seen from the following table (numbers are the line count in each result):

plain response processed by uniq processed by sort and uniq
non-paged 464023 456985 456983
concatenated paged 509284 503953 503951

Documentation says that paged results might not be as up to date as the non-paged ones. That can explain the differences. The duplicates could perhaps be due to errors in creation of the cdxes. Even though I'm not happy, I accept that for now.

So I looked at the results for sort=closest:

plain response processed by uniq processed by sort and uniq
non-paged 423210 423210 423210
concatenated paged 480941 480941 480941

The duplicates are gone :+1: , but I expected the results to be the same size as deduplicated non-sorted results e.g. 456983 == 423210 and 503951 == 480941 which is not the case. For the sort=closest, I also expected the first page of the paged result to be close to first part of the non-paged result (with small differences due to paged results not being as up-to-date as the non-paged). But they where not even close.

I have played around with other parameters as well and there are lot of examples where the results are not what I would expect.

Conclusion

My first thought was to "fix the bugs". I started to guess what the right outcome for the different parameters should be. That was not easy and the reason I started to write on the use-cases document. I found the api itself to be unlogical and redundant. For example there are no less than three different ways to achieve some form of paging or reducing the number of captures in each response. You have the limit/offset, the resumeKey and the page apis. There are two ways to tell how to match the url (which could be given simultaneous, contradictionary values). For collapsing, there are both collapse and collapsetime. Some of this has been discussed earlier.

Even with those things put aside, I found the earlier mentioned bugs hard or impossible to fix without causing to much work on the server. For example to get what I would expect from the paged queries with sort=collapse requires the server to scan through all the results and sort them before returning a page. That is not compatible with the use as a source for big data processing. The example is also combining a prefix search with sort=collapse which I can't see any use for, but puts a heavy burden on the server.

Since the Cdx Server until now has been an optional part of an OpenWayback deployment and the api is labeled beta in the documentation, I thought this was the time if we ever should rethink it from the ground up. With that in mind I made a list for myself of things I would like to achieve with a redesign:

1 Avoid redundant functionality

2 Disallow combination of functionality which is causing to much work on the server

3 Avoid letting parameters alter the format of the response An example of this in the current api is showNumPages which gives a number instead of the usual cdx lines. I think such cases should have its own path.

4 Try to implement the same set of functionality independently of the backing store For example current api needs zipnum clusters to support paging

5 Avoid hidden, server side limits which alter the result in any way except from returning partial results It is ok to impose limits to the size of the returned response, but for example letting a count function giving wrong results because it's to resource consuming to scan through enough data is not ok.

6 Let it be possible to use the Cdx Server as a general frontend to cdx-data For tool creators it would be nice if it is not necessary to know if cdx-data is stored in cdx-files, zipnum-clusters, bdb or other backing storage.

7 Allow for distributed processing and aggregation of results For big deployments, it would be nice if you could deploy cdx servers for parts of the cdxes and do most of the processing on each part before delivering the result to an aggregating cdx server. This is for example not possible with the current definition of collapse.

8 Avoid keeping state on server If keeping state on the server is unavoidable, it should be hidden from the client. For example if paging needs processing of more than a single page at a time, each page should have fully qualified urls to force the client to get the next page from the server which keeps the cached result of that page.

9 Illegal queries should be impossible If some combinations of parameters are illegal, then it would be better if those where not possible. The amount of documentation would be reduced and the frameworks used for implementing the api could then return 400 Bad Request automatically.

10 Testability Current implementation allows for a lot of combination of parameters. It is a huge amount of work to create tests which cover all possibilities. By reducing the number of allowed combinations to those with an actual use case, it is easier to create the right set of tests.

11 Maintainability This applies to code. For example try to reduce the number of if statements to ease the reading of the data flow. But also to reducing the number of possible combinations of parameters to something we are able to guarantee would be supported in the future.

12 Using HTTP protocol where it is possible For example removing the output-parameter in the current api and use Accept header instead.

The above points are guidelines and maybe not possible to achieve in its entirety, but I think it is possible to come pretty close.

Based on this I ended up with proposing three different paths in the api. To be more RESTful we could have one entry point with references to the other paths. The path names could definitely be better and are certainly not finalized.

/search is for all functions which needs to process cdx-data sequentially. It is assumed that all backing-stores can do that pretty efficiently, either because the cdx-data is presorted, or there exist sorted indexes on keys if the backing-store is a database. This allows for the richest set of parameters, but does not allow sorting which might require a full table scan to use db-terminology. For example this allows collapse which needs to compare cdx-lines in sequence, but not sort=closest because that needs to go through all the data. Responses can be split into smaller parts with the use of a resumption key. Paging api is not supported because it might not be possible to deliver pages in parallel without also processing sibling pages. This path is typically used to support search and also showing aggregated results e.g. a calendar in a replay engine.

/get is for getting the closest match for a url/timestamp pair. This implies sorting which might be costly. The set of functionality is therefor somewhat limited. Ideally it should only return one result, but cause to the nature of web archives it might be needed to return a few results close to best one. This path will not allow scanning through a lot of results so no resume key is used. It is not allowed to use urls with wildcards because of the potentially huge amount of responses and it is not well defined what a closest match is to a fuzzy query.

/bulk is primarily to support number 6 above. I can't see any uses for this in a replay engine like OpenWayback, but it might be really useful as a standardized api for browsing and processing cdx-data. One requirement is to allow parallel executions of batches. That makes it impossible to support sorting and comparing of captures without posing to much work on the server. The server should split the response into enough batches to support a reasonable big map-reduce environment, but each batch could further be limited into parts (which is requested in sequence for each batch) to overcome network limitations.

nlevitt commented 8 years ago

Quick comment on https://iipc.github.io/openwayback/api/cdxserver-api.html The example surts look like this: com,example,www)/foo.html

Not sure if that's meant to be normative or what. But imho it was an unfortunate choice for wayback to drop the trailing comma in the host part. In heritrix the surt would look like http://(com,example,www,)/foo.html. Stripping the "http://(" is sensible enough. But without the trailing comma, parent domains no longer sort next to their subdomains. That means it's not possible to retrieve results for a domain and its subdomains with a single query. Plus it's a needless divergence from other tools that use surts, like heritrix.

(It might be too late to change this. I think Ilya tried with pywb but it gave him too many headaches. It really bugs me though.)

ikreymer commented 8 years ago

@nlevitt Yeah, after some consideration, the change did not see the extra effort, as would have to support both with and without the comma. My reasoning was that between com,example,www) and com,example,www, there only com,example,www* and com,example,www+ which are not valid domain names anyway.. So, a search from X >= com,example,www) and X < com,example,www- should still cover the domain and any subdomains.

ikreymer commented 8 years ago

@johnerikhalse Thanks for the detailed analysis and testing of pagination, and explaining the endpoints again. I understand your point but still disagree, especially with regards to /search and /get. I think something like a /get endpoint makes sense as an endpoint for retrieving an actual resource (from WARC file or other source).

The cdx api is a low level api, non-end user api, so I think its best to show what is actually going on. /get is still the equivalent of sort=closest&closest=... with other functionality disabled.. (The sort=closest should be implied if closest= is specified, if that isn't happening already..)

I like the idea of returning a 400 for invalid options, so the following restrictions can be added:

I think you're right in that the sort option, which sorts by timestamp, only really makes sense for exact matches. It can be costly with a large result set, but can be quite fast if limit=1 is used..

I think that addresses the sorting issues.

Now, with pagination, unfortunately, its a bit complicated. In addition to requiring zipnum, the pagination query can only be run on a single zipnum cluster at a time, or with multiple clusters if the split points are identical. IA uses several zipnum clusters and the results are merged. Thus, the showNumPages query and changing the page= results in querying only a single zipnum cluster. What pagination does is provide random seek access within a set of cdx files, so a secondary index is necessary. If there are several indices and they don't match, only one can be used. This is the reason for the discrepancy you see. The pagination query on IA wayback queries less data than the 'regular' query. The clusters that are not queried are much smaller so for bulk querying the discrepancy was not significant.

In an ideal world, the pagination API would support both regular cdxs and multiple zipnum clusters, however, this is a hard problem and (to my knowledge) no one is working on a solution to this. My recommendation would be to keep the pagination API in beta and put any cycles in solving the hard problem of supporting bulk querying across multiple cdx and multiple zipnum clusters, if this is a priority.

While I can see the value of a separate /bulk endpoint, at this point, moving to a new API will only break existing client-side tools w/o doing much to address the current limitations of the pagination API.

The above suggested restriction on sort and resumption key will prevent that from being used with the pagination API and other options (such as collapse) can also be restricted, preventing any 'invalid' combinations. Again, I think returning 400 with a clear message is a great way to handle undesired param combinations.

johnerikhalse commented 8 years ago

@nlevitt

The example surts look like this: com,example,www)/foo.html

Not sure if that's meant to be normative or what. But imho it was an unfortunate choice for wayback to drop the trailing comma in the host part. In heritrix the surt would look like http://(com,example,www,)/foo.html.

No, it's not meant to be normative. I wasn't aware of the differences to Heritrix. Anyway, the Cdx Server only serves what's in the cdx files, so this is probably more of a cdx-format concern than Cdx Server. Preferably the normalization of urls should be the same at harvest time as at search time, so I totally agree to your concern.

I don't know what the problem with changing this is. Is it hard to implement, or is it just that you need to regenerate the cdxes?

johnerikhalse commented 8 years ago

@ikreymer

The cdx api is a low level api, non-end user api, so I think its best to show what is actually going on.

I don't think whether the api is low level or end user, is important. What is important is whether the api is public or not. By public I mean that it is exposed to other uses than internally in OpenWayback. You mention that other tools are using the api which in my opinion, makes it public. IMHO public apis have most of the same requirements as end user apis even if they are meant for expert use. That includes clear definitions on both input and output, but also taking care to not unnecessarily breaking backwards compatibility, which seems to be your main concern. But if we should break compatibility, I think this is the right time.

I think something like a /get endpoint makes sense as an endpoint for retrieving an actual resource (from WARC file or other source).

Sure, but that is high level. A service like that should use the Cdx Server /get behind the scenes. If thinking in terms microservices, we need both. Actually we could also have a low level resource fetching service which takes (in the case of warcs) a filename and offset as input, and delivers the content. The high level service would then first use the /get from the cdx server to get the filename and offset, and then request the resource service for the actual content. If we did this, I would also suggest taking a look at the cdx format. If we substituted the filename and offset fields with a more general resource location field with its own semantics (could for example be warcfile:<filename>:<offset> in the case of warcs), then such a setup could support other kinds of storage as well. It would for example be possible to address content by WARC-Record-ID if the underlying storage supports it.

Now, with pagination, unfortunately, its a bit complicated. In addition to requiring zipnum, the pagination query can only be run on a single zipnum cluster at a time, or with multiple clusters if the split points are identical. IA uses several zipnum clusters and the results are merged. Thus, the showNumPages query and changing the page= results in querying only a single zipnum cluster. What pagination does is provide random seek access within a set of cdx files, so a secondary index is necessary. If there are several indices and they don't match, only one can be used. This is the reason for the discrepancy you see. The pagination query on IA wayback queries less data than the 'regular' query. The clusters that are not queried are much smaller so for bulk querying the discrepancy was not significant.

I'm aware of this and think this alone justifies a separate endpoint. API consumers would expect that the query parameter page= would just give a portion of the same result as without the parameter. If there's no way around giving a different result set with paging, then that should be clearly indicated by separating from the non-paging api.

The reason I suggest dropping the page= all together and returning a list of urls for the batches is to open up for solving the discrepancy. The server could then for example return links to point to each index in turn so that all indexes are queried, but without the need to merge them. That's why I also suggest that the batches could overlap (See #309).

ibnesayeed commented 8 years ago

So, I would like to discuss another point here. Which is more related to the response serialization than the lookup process and filtering. CDX index in general has (and should have) keys for lookup/filtering and data to locate the content and some other metadata that may be useful for tools.

However, the method of locating data might differ depending on the data store, for example local warc files, data stored in isolated files (such as downloaded by wget [without the warc flag]), stored in cloud storage services and whatnot. Keeping this in mind I think bringing some uniformity in how the data location is described should be considered. CDX file format had pre-defined fields, but as we are moving away from that and adopting a more flexible serialization format such as CDXJ, we can play with our options. I would propose that multiple CDX fields that have no other purpose than to locate the data in combination such as the file name, offset, and the length should be consolidated in a URI scheme such as:

file:///host:/path/to/file.warc.gz#123-456

Note: "file" URI scheme does not talk about the fragment though. If needed URN can be used instead to offload the resolution of the location to a different level without polluting the standards. A URN can also be useful when the actual location is determined with the help of a path-index file.

This type of URI scheme based identification of the content will allow us to merge data from various sources in a uniform way in a single CDX response.

Additionally, the same data might be available in different locations (this is not something that was considered before I guess), so the CDX server should provide a way to list more than one places where the data can be found. The client may choose which place it wants to grab the data from or fallback to other resources if the primary one is not available.

To put this together, we can introduce a key in CDXJ named locators in the JSON block that holds an array of various locations. Here is an example:

surt_url_key 14_digit_datetime {"locators": ["file:///host:/path/to/file.warc.gz#123-456", "http://example.com/warc/segments/file", "s3://bucket/path/to/object", "urn:ipfs:header_digest/payload_digest"], "urir": "original uri", "status": 3-digit, "mime": "content-type", "more_attrs": "more values"}
ikreymer commented 8 years ago

@ibnesayeed While this may seems like an interesting idea at first glance, I don't think this provides much of a practical benefit.

A regular sort-merge will merge the same cdx lines together, there is no need for custom merge for just the filename field. Eg, if querying multiple archives one could get:

com,example)/ 2016 {"filename": "file://warc-file.warc.gz", "source": "archiveA"}
com,example)/ 2016 {"filename": "urn:ipfs:header_digest/payload_digest", "source": "archiveB"}
com,example)/ 2016 {"filename": "s3://bucket/path/to/object", "source": "archiveC"}

This allows for easily determining duplicate copies of the same url, from different sources.

If the intent is to load the best resource, the sources will be tried in succession (though the filenames may be in arbitrary order after the merge) until one succeeds, with fallback to the next resource, and the next-best match, etc...

No special case is needed here.

Also, traditionally, a separate data source (a path index or prefix list) has been used to resolve a WARC filename to an absolute path internal to a single archive. This has many benefits, including supporting of changing WARC server locations, keep cdx small by avoiding absolute paths, and avoiding exposing private/internal. For example, an archive could be configured with 3 prefixes: /some/local/path/, http://internal-data-store1/, http://internal-data-store2/

An API responsible for doing the loading would be configured with these internal paths and would then check /some/local/path/mywarc.warc.gz, or http://internal-data-store1/mywarc.warc.gz or http://internal-data-store2/mywarc.warc.gz.until one succeeds.

There is little benefit in baking these paths into an index response, as that would add extraneous data and potentially expose private paths that aren't accessible anyway.

ikreymer commented 8 years ago

In general, I think that this approach is probably trying to solve the following problem: the need to query an index and the need to load a single resource from the index. The /get endpoint mentioned above also attempts to solve this same problem.

I would like to suggest the following solution, consisting of just two API endpoints that are almost identical.

The resource API is currently outside the scope of the CDX server as its currently defined, but I think its relevant to think of the two together.

For example, If a client wants to examine the list of the 5 best matches for http://example.com/ at 20160101000000, the query would be:

/coll/index?url=http://example.com/&closest=20160101000000&limit=5

but if the user wants to load the first available resource of the 5 best matches, the query would be:

/coll/resource?url=http://example.com/&closest=20160101000000&limit=5

There is only one API to learn, and the end-user has full flexibility about what sort of data they want to get in a single call. Query features, such as filtering, are also available for both endpoints allowing for more sophisticated retrieval options.

Of course, the /resource endpoint will likely be implemented by querying /index internally or explicitly (this is up-to the implementation) and the WARC absolute path resolvers can be defined as needed (also implementation dependent). The end-user of the API need not worry about these details!

tokee commented 8 years ago

"4. Get the best match for embedded resources" seems to be a scenario with multiple URIs and one shared timestamp, presumably the timestamp of the page referencing the resources. As I understand the current API, that requires 1 lookup/resource. That means 1 request-parse-serialize-response overhead per resource. If the API allowed looking up multiple URIs with the same as-near-to-this-as-possible-timestamp, that overhead could be reduced considerably.

tokee commented 7 years ago

My colleague @thomasegense is experimenting with a light page-render for WARC-based web archives. Instead of doing URL+timestamp -> WARC-name+offset CDX-like lookups on page resource resolve time, he does it up-front by re-writing the URLs in the HTML. This render system makes it possible and preferable to do the batch-lookups I asked for in the previous comment, as local experiments non-surprisingly shows batch-processing of lookups is significantly faster than individual requests.