Proposal for new record field: WARC-Json-Metadata

iipc / warc-specifications

Centralised repository for WARC usage specifications.

http://iipc.github.io/warc-specifications/

96 stars 27 forks source link

Proposal for new record field: WARC-Json-Metadata #27

Open ikreymer opened 8 years ago

ikreymer commented 8 years ago

The WARC-Json-Metadata field can be added to any record and contain an arbitrary JSON dictionary {} (The JSON must be single line, no newlines allowed).

This field allows users to add any metadata as necessary under a single WARC field. The contents of the metadata field will vary by use case, but this at least allows for a common place to store such extra metadata.

Ex:

WARC-Json-Metadata:  {"snapshot": "html", "collection": "..."}

Indexing tools which support a variable number of fields (such as CDXJ, or Solr schemas, etc..) can choose to add the metadata fields to the index entry.

ikreymer commented 8 years ago

Current specific use-case

https://webrecorder.io/ adds this type of record for "static snapshots", which are created from the current HTML of the page.

The snapshot record (see #5, an alternative way to handle #13) contains:

WARC-Type: resource
...
WARC-Json-Metadata: {"snapshot": "html"}

The cdx-indexer tool in pywb, when creating CDXJ entries, add the contents of the metadata under a metadata field, which contains the JSON dictionary. (This is still experimental).

nclarkekb commented 8 years ago

One or more of the RFC's refered to by the WARC std. encourages the use of headers that are not too long. Also LWS should be supported by WARC readers/writers so long headers are split into many lines with Leading White Space.

Opening up for a json header that can potentially have a big string leaves WARC parsers with a potential memory problem when lines can be huge.

kris-sigur commented 8 years ago

I agree with @nclarkekb

I'd also like to add that having a header field whose content is unspecified (the formatting is specified but not the actual contents) is counter-productive. If implemented it will inevitably lead to differing uses making the field impossible to handle without understanding what the provenance of the WARC is. This cuts directly against the whole point of standardization.

The WARC standard already allows for blobs of unspecified metadata in metadata records. I'll concede that there may be performance benefits to being able to cram the data into the WARC response record's header, but that seems an insufficient reason to introduce such a 'hack' into the standard.

If there are specific items of data that really, really need to be in the WARC header, I'd strongly prefer that each be assigned its own, well described, WARC header field. It may be a little more work to enumerate and describe them, but in the end you at least get something that is actually standardized.

ikreymer commented 8 years ago

I understand the concerns, I wanted to suggest one header instead of many as I thought that would be easier to support for standardization. The exact semantics are still experimental, which is why I was proposing a catch-all WARC-Json-Metadata field for now.

The idea is to include a 'small amount' of data in such a metadata, something that does not make sense for a full metadata record.

Some of the use cases I have are as follows:

Arbitrary 'tagging' of WARCs records, very similar to github issus labels, that could be added to any WARC record. Eg: WARC-Json-Metadata: {"tags": ["news", "foo", "bar"]} This could be a separate WARC header, WARC-Tags. Tags could also have values so you have name-value pairs.
Indicating that a resource is a static snapshot. This was originally mentioned in #5 basically a way to indicate the 'provenance' of a client-side resource, eg. it was an HTML snapshot, capture using user-agent X. For now was thinking of this as a special case tag use case, but maybe makes sense to have a separate field, WARC-Client-Provenance ?
Page list, listing all the 'pages' that a user has recorded. For webrecorder, this would be those URLs which the user visited explicitly (that are not embeds) which can be arbitrary. This is usually a per WARC setting though, so perhaps a metadata record is more appropriate.

Even if this does not make sense to add to a standard, and if the standard will clarify that custom WARC fields are supported/encouraged, perhaps there should be a centralized registry of WARC fields and their usage so that everyone is aware of what custom fields are being used and how..

nlevitt commented 8 years ago

It seems to me that the warc metadata record (concurrent-to another record) has proven to be impractical. I don't know of playback software that consults it, in spite of the fact that crawlers have been writing it for years. I think that's mainly because it is cumbersome to find and load.

So for adding metadata to a warc record, I suppose it comes down to an aesthetic choice. Do we prefer one open-ended json metadata header, or a multitude of custom WARC-* headers? I would lean toward the former. It seems cleaner to me, and lighter weight for a developer working on some new feature. (If I were in that situation, I could see myself reinventing WARC-Json-Metadata rather than adding a narrowly scoped custom header.)

nclarkekb commented 8 years ago

I do not like the idea of open ended headers that can be potentially large. Especially when there is no length info available at all.

(We use the metadata record locally but only for long term preservation so I get that it is impractical for normal use.)

A solution where the length would be (partially) known and the content was not locked to json would be to just use the multipart content encoding content type just like HTTP POST forms and email content where different information needs to be encapsulated in a single record. You could even suggest the first part should be an index of included parts.

This would be flexible to include as many json or other parts as you like and also the original content.

The cost would be slightly more complex WARC readers/writers. But you could skip content you want to ignore and only extract the part you want.

Multipart content encoding could potentially also ommit the content-length since it can be deduced by reading and skipping the different record parts. And supporting trailing headers like HTTP(mostly unysed) could allow for writing the digests without having to calculate them before hand.

Just thinking aloud here though... (Come to think about it, I mixed multipart and chunked content abit arround concerning knowing the length)

ikreymer commented 8 years ago

A solution where the length would be (partially) known and the content was not locked to json would be to just use the multipart content encoding content type just like HTTP POST forms and email content where different information needs to be encapsulated in a single record. You could even suggest the first part should be an index of included parts.

That seems entirely too complicated to add a few pieces of metadata.. Sounds like this would still require a separate metadata record, which as @nlevitt mentioned, has proven impractical for indexing.

This is just for adding small amount of metadata, like {"snapshot": "html", "coll": "name", "foo": bar}

Another important reason for all this, is that with the CDXJ format being discussed (https://github.com/oduwsdl/ORS/wiki/CDXJ) this would allow for easily inserting the metadata json into the CDXJ json, thus allowing for arbitrary data to be added to the replay index, and surfacing these fields for replay.

I'm not sure I understand the concern with field length... We could specify a max field length, and/or support multiline headers with leading whitespace (though these are rarely used), but I'm not sure that this is an issue really. These days, headers can already get rather large and most tools have a way of adjusting header size limit, if needed. For example, the Link header returned from a memento-enabled wayback tends to get rather long (when the same url is repeated for first, last, next, prev and closest mementos) and seems no one has complained..

ato commented 8 years ago

I agree that it can be more practical to include custom metadata in the response/resource record headers instead of a separate metadata record so that it can be retrieved without an index.

I also agree with @kris-sigur that just standardising the name of a completely semantically open field seems pointless as the headers are already extensible. If a tool needs to be able to process non-standard metadata it may as well fetch it from a non-standard header field. Then you can name the field something like "Pywb-Json-Metadata" so that a tool processing it can distinguish it from "Niftycrawler9000-Json-Metadata" and know that the "snapshot" JSON field refers to #5 rather than to say a crawl checkpoint.

An argument could be made for standardising the name of a single JSON field so that indexing tools know to copy that particular metadata into their index. I would be more inclined to support a field like this if such handling was specified in detail and if the JSON fields were registered somewhere (a wiki page would do). Personally however I would want to configure which fields are indexed to minimize the size of the index and if that is configurable it's just as straightforward to configure which custom header field to read it from.

ikreymer commented 8 years ago

All good points. Thinking about it more, yes, the semantic significance of this is to indicate which should be included in indexing tools and (hopefully) surfaced during replay.

For example, a field might be called WARC-Replay-Metadata and by putting data in there, the creator of the WARC indicates that this data is significant for replay and should be surfaced to the user, if possible. I think this would guide what sort of data should be included: information about provenance: static snapshot vs interactive page, what collection or which user created the page should be included. Crawling checkpoint data probably should not, and it can go into its own custom field.

Tools that index to formats supporting custom fields (CDXJ, Solr, etc..) would then index these fields by default. Of course, you can still have custom options in your indexer that includes other fields, but this is just about starting with sensible defaults for what should be indexed. When a user runs an indexing tool, they should expect to see the contents of this standard header included, the same way that content type and status code are included in the classical CDX now.

We can more concretely enumerate the data that should go there, and it will be up to individual replay tools to support various fields, but this is just an attempt to start somewhere.

kris-sigur commented 8 years ago

Whether it is one or many fields, the important part is that any field included in the standard be well defined. There is nothing wrong with it being multiple fields if there is a need for them.

That said, I still don't find this ideal as a fix for the low discoverability of the metadata records. I think it is important to be able to include related records and have them accessible at replay time. Perhaps the real solution here (or part of the solution) would be to address this issue directly instead of trying to avoid it.

ato commented 8 years ago

Here are a few possible solutions to the problem of discovering related records.

Status quo (at least in the implementations I'm aware of)

Build Wayback-style indexes mapping (uri, date) -> (file, position) for response, resource and (sometimes) revisit records. Request, metadata and conversion records are ignored.

Solution 1. Three indexes built in a single pass.

If we were to instead key on (uri, date, record-type) we could easily locate the request, response, resource and conversion records. But not necessarily their associated metadata records as WARC-Target-URI is optional for metadata records.

In order to find the metadata record for a given response record you would need a second index mapping (refers-to or concurrent-to) -> (file, position).

In order to find the referred-to record given a metadata record you would need a third index mapping (record-id) -> (file, position).

Solution 2. Two indexes built in two passes.

Pass 1. Build a (record-id) -> (file, position) index. Pass 2. Build a (uri, date, record-type) -> (record-id) index. When encountering metadata records lookup the refers-to or concurrent-with record ids in the record-id index to obtain the uri and date for the index key. Use something specific like "request-metadata", "response-metadata" as the record-type part of the index key.

Solution 3. One index built in a single pass. Extended WARC format.

Ensure:

When writing a concurrent-to style metadata record at crawl time always populate WARC-Target-URI.
For refers-to style metadata records created at a later date extend the WARC-Refers-To-Target-URI and WARC-Refers-To-Date fields from the Recording Arbitrary Duplicaes proposal to metadata records.

That would make all the metadata records discoverable through the one index query, although you'd still have to read them all to work out which metadata belongs to which record (response metadata vs conversion metadata), which suggests either also adding a WARC-Refers-To-Type or doing a second pass.

Solution 4. Colocate related records. No index, no queries but strong constraints on the file structure.

Various ways of doing this:

Embed the metadata in the WARC header
Multipart content block
Ensure that concurrent-to records are adjacent in the file
Byte offset pointers within the file

Given one record you can discover the others without an external index. You lose the ability to add extra linked records at a later time and to store related records in different files.

saraaubry commented 8 years ago

Following IIPC members recommendations and discussions during the ISO working group meeting on November 16-17, 2015: the topic is not mature enough, so the issue is out of the 1.1 revision.