oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
616 stars 39 forks source link

Show and make accessible for use recent CDXJ files in the replay system web ui #82

Open machawk1 opened 7 years ago

machawk1 commented 7 years ago

The idea behind this would be fast switching between "data sets" but also set the foundation for "merging" CDXJ files. As another use case, if we ever provide the ability to extract CDXJ lines only relevant to some user-provided parameter (e.g., only .co.uk URI-Rs), this would lessen the temporal burden required by the replay system to "find" the relevant URI-R/M at query time.

@ibnesayeed Thoughts on procedures and dynamics of cdxj merging? Your work on archive profiling seems relevant here.

ibnesayeed commented 7 years ago

I am not sure if I understood it completely, but in general I don't like the idea of letting the client to select the CDXJ files for reply. CDXJ is nothing but an index that can be created incrementally and the number of such files can be as little as one or as many as practically impossible for a human to deal with them n the client side. Merging CDXJ files is a trivial task and there is no magic involved in that. What would be a better idea in my opinion is toprovide an administrative interface to manage collections (namespaces) and associate one or more CDX files to each collection, then the users can select those named collections and not worry too much about the underlying details.

machawk1 commented 7 years ago

@ibnesayeed That's a good idea. I would like to see the association of archival collection-->set of index files as well in the future. An admin interface would be one way to accomplish this.

Merging CDXJ files is a trivial task and there is no magic involved in that.

I am aware of this but the current ipwb implementation allows the use of one cdxj file at a time, which seems very limiting. Manipulation of ipwb-compatible cdxj files might be the job of a separate (sub-)tool.

ibnesayeed commented 7 years ago

The current implementation allows only one CDXJ file because of the nature of the hackathon we developed it in. Extending it to iterate over a list of CDXJ files would not be difficult, but before that we need collection name spacing in place.

machawk1 commented 7 years ago

@ibnesayeed How do you imagine this list of CDXJ files within a collection should be specified by the user?

ibnesayeed commented 7 years ago

There are really many ways to implement this. Some would be more flexible and customizable than others, but might require more components such as some sort of database. One simple approach would be to introduce a convention and utilize the structure of the file system itself.

/ipwb/collections/
├── bar/
│   ├── bar-1.cdxj
│   ├── bar-2.cdxj
│   └── bar-3.cdxj
├── foo/
│   ├── baz/
│   │   └── foo-baz-1.cdxj
│   ├── blah/
│   │   ├── foo-blah-1.cdxj
│   │   ├── foo-blah-2.cdxj
│   │   └── metadata.yaml
│   ├── foo-1.cdxj
│   └── metadata.yaml
└── metadata.yaml

Look at the above directory and file organization. With this in place, if the replay server is invoked with the following command:

$ ipwb replay /ipwb/collections/

The server should recursively read all the CDXJ files that fall under the selected collection name space. For example, when requested for collection bar, it should lookup using bar-1.cdxj, bar-2.cdxj, and bar-3.cdxj. When requested for foo, it should lookup using foo-1.cdxj, foo-baz-1.cdxj, foo-blah-1.cdxj, and foo-blah-2.cdxj. However, when requested for the collection foo/blah, it should only lookup using foo-blah-1.cdxj and foo-blah-2.cdxj. If no collection is specified then it will read all the CDXJ files under /ipwb/collections/ recursively. Additionally, each collection directory (on any nested level) can contain an optional metadata.yaml file that will customize various properties of the collection such as a more human friendly for of the collection name, the description of the collection, some inclusion/exclusion patterns to override the default behavior of the replay system for that collection name space. PyWB does similar collection management, but I am not sure if that supports recursive sub-collection feature or just a flat list of collections.

machawk1 commented 7 years ago

@ibnesayeed Good stuff. We can use this as the basis of introducing the collection concept into ipwb at some point.