Open ikreymer opened 9 years ago
The json fields of this format are different from CDX, since it refers not to a single capture, but to a block, and so would contain a filename (currently called part), offset, length, as well as a lineno (optional).
I suppose if CDXJ has as requirements only offset
, length
and filename
then this can be considered CDXJ/
Ah, I see now that current thought is no restriction on the fields in the current proposal.
It seems like a custom @context
definition?
@context ["<ZipNum Secondary Index>"]
@keys ["surt_uri", "timestamp"]
Might it be useful to optionally define the fields that are present in the JSON dict, for example to indicate, if certain optional fields are present?
Thanks for bringing this up. In my opinion, there is no reason to give it another name. It perfectly fits in the existing scope of the CDXJ. Unlike CDX, CDXJ does not have predefined fields and that's where the flexibility of the format comes. The key portion can have any number of keys as long as number of key prefixes is same for every entry in a file (missing fields will have a placeholder dash -
like CDX). The key fields are described in the meta section under @keys
entry. The value portion can have any number of arbitrary properties, the format itself does not impose any restriction on that and leaves it on the application.
Yes, the @context
should define the keywords used in the JSON dict and optionally also introduce validation rules if any.
Yes, the @context should define the keywords used in the JSON dict and optionally also introduce validation rules if any.
Would @context
be like a schema or something more informal?
Would
@context
be like a schema or something more informal?
The Idea of @context
and @id
is inspired from JSON-LD.
In addition to CDXJ, the ZipNum format uses a secondary index, which also includes a sortable url key but contains other data in the JSON dictionary.
The original format of this is is tab-delimited (to differentiate from space delimited cdx), but this should be made conformant to the ORS spec.
An example of the raw IDX: http://index.commoncrawl.org/CC-MAIN-2015-06-index?url=*.edu&showPagedIndex=true
An equivalent JSON from http://index.commoncrawl.org/CC-MAIN-2015-06-index?url=*.edu%2F*&showPagedIndex=true&output=json
The IDXJ/CDXJ/ORS equivalent would be:
Any thoughts on what this should be called, eg. does this still fall within CDXJ, or is it under ORS/IDXJ?
I think this will help standardize the CDX Server API a bit more.