oduwsdl / ORS

Object Resource Stream and CDXJ Drafts
15 stars 2 forks source link

ZipNum Secondary Index use case #1

Open ikreymer opened 8 years ago

ikreymer commented 8 years ago

In addition to CDXJ, the ZipNum format uses a secondary index, which also includes a sortable url key but contains other data in the JSON dictionary.

The original format of this is is tab-delimited (to differentiate from space delimited cdx), but this should be made conformant to the ORS spec.

An example of the raw IDX: http://index.commoncrawl.org/CC-MAIN-2015-06-index?url=*.edu&showPagedIndex=true

ec,opcionempleo)/ofertas-empleo-ecuador-114371.html 20150127010331  cdx-00229.gz    23024741    217405  466792
edu,aacc)/business/tlcs_smart.cfm 20150128161301    cdx-00229.gz    23242146    143586  466793
edu,aacc)/newsonline/2012/04/release12216.cfm 20150131004647    cdx-00229.gz    23385732    159461  466794
edu,aacc)/search/course/crs_desc.cfm?courseid=49642 20150130183246  cdx-00229.gz    23545193    186929  466795
edu,aamu)/administrativeoffices/business-and-finance/pages/default.aspx 20150201003049  cdx-00229.gz    23732122    208546  466796

An equivalent JSON from http://index.commoncrawl.org/CC-MAIN-2015-06-index?url=*.edu%2F*&showPagedIndex=true&output=json

{"urlkey": "ec,opcionempleo)/ofertas-empleo-ecuador-114371.html 20150127010331", "part": "cdx-00229.gz", "offset": 23024741, "length": 217405, "lineno": 466792}
{"urlkey": "edu,aacc)/business/tlcs_smart.cfm 20150128161301", "part": "cdx-00229.gz", "offset": 23242146, "length": 143586, "lineno": 466793}
{"urlkey": "edu,aacc)/newsonline/2012/04/release12216.cfm 20150131004647", "part": "cdx-00229.gz", "offset": 23385732, "length": 159461, "lineno": 466794}
{"urlkey": "edu,aacc)/search/course/crs_desc.cfm?courseid=49642 20150130183246", "part": "cdx-00229.gz", "offset": 23545193, "length": 186929, "lineno": 466795}
{"urlkey": "edu,aamu)/administrativeoffices/business-and-finance/pages/default.aspx 20150201003049", "part": "cdx-00229.gz", "offset": 23732122, "length": 208546, "lineno": 466796}

The IDXJ/CDXJ/ORS equivalent would be:

ec,opcionempleo)/ofertas-empleo-ecuador-114371.html 20150127010331 {"part": "cdx-00229.gz", "offset": 23024741, "length": 217405, "lineno": 466792}
edu,aacc)/business/tlcs_smart.cfm 20150128161301 {"part": "cdx-00229.gz", "offset": 23242146, "length": 143586, "lineno": 466793}
edu,aacc)/newsonline/2012/04/release12216.cfm 20150131004647 {"part": "cdx-00229.gz", "offset": 23385732, "length": 159461, "lineno": 466794}
edu,aacc)/search/course/crs_desc.cfm?courseid=49642 20150130183246  {"part": "cdx-00229.gz", "offset": 23545193, "length": 186929, "lineno": 466795}
edu,aamu)/administrativeoffices/business-and-finance/pages/default.aspx 20150201003049 {"part": "cdx-00229.gz", "offset": 23732122, "length": 208546, "lineno": 466796}

Any thoughts on what this should be called, eg. does this still fall within CDXJ, or is it under ORS/IDXJ?

I think this will help standardize the CDX Server API a bit more.

ikreymer commented 8 years ago

The json fields of this format are different from CDX, since it refers not to a single capture, but to a block, and so would contain a filename (currently called part), offset, length, as well as a lineno (optional).

I suppose if CDXJ has as requirements only offset, length and filename then this can be considered CDXJ/

ikreymer commented 8 years ago

Ah, I see now that current thought is no restriction on the fields in the current proposal. It seems like a custom @context definition?

@context ["<ZipNum Secondary Index>"]
@keys ["surt_uri", "timestamp"]

Might it be useful to optionally define the fields that are present in the JSON dict, for example to indicate, if certain optional fields are present?

ibnesayeed commented 8 years ago

Thanks for bringing this up. In my opinion, there is no reason to give it another name. It perfectly fits in the existing scope of the CDXJ. Unlike CDX, CDXJ does not have predefined fields and that's where the flexibility of the format comes. The key portion can have any number of keys as long as number of key prefixes is same for every entry in a file (missing fields will have a placeholder dash - like CDX). The key fields are described in the meta section under @keys entry. The value portion can have any number of arbitrary properties, the format itself does not impose any restriction on that and leaves it on the application.

ibnesayeed commented 8 years ago

Yes, the @context should define the keywords used in the JSON dict and optionally also introduce validation rules if any.

ikreymer commented 8 years ago

Yes, the @context should define the keywords used in the JSON dict and optionally also introduce validation rules if any.

Would @context be like a schema or something more informal?

ibnesayeed commented 8 years ago

Would @context be like a schema or something more informal?

The Idea of @context and @id is inspired from JSON-LD.