te-papa / collections-api

Museum of New Zealand Te Papa Tongarewa - Collections API
10 stars 2 forks source link

Getting started/Object model: Date fields #7

Closed fkleon closed 5 years ago

fkleon commented 5 years ago

Date fields are currently not very well explained in the documentation. Not sure whether that should go exactly, probably the object model documentation rather than the Getting started guide?

Rough draft:

Date fields

The nature of the data in the collections makes handling dates especially challenging. It's possible that dates are completely unknown, only partial known, or otherwise fuzzy. Often date fields are referring to a date range. Exact notations used in some (date-describing) text fields vary, and mistakes are introduced due to human error.

Date fields in the Collections API are exposed in various ways:

Verbatim date fields

Verbatim date fields usually contain the data as extracted from the Collections management system. Good examples are verbatimBirthDate or verbatimDeathDate:

"verbatimBirthDate": "11 June 1865",
"verbatimDeathDate": "13 April 1935",

These are often of good quality and accompanied by "parsed" date fields containing an ISO date string:

"birthDate": "1865-06-11",
"deathDate": "1935-04-13",

Facetable date fields

Facetable date fields are primarily designed to be used as facets. A facetable date field contains sub-fields describing aspects of the date, such as century, dayOfWeek, etc. Sometimes these values are labelled Unknown or are just approximations. Facetable dates are usually accompanied by a verbatim date field.

Facetable date fields are still experimental, and you should not rely on the facetable date fields to represent the "truth" about a date. Use the verbatim date fields if you need to convey the true date to a user, but you can use the facetable date fields to assist in data analysis or approximate categorisation.

An example is the production.facetCreatedDate:

 "createdDate": "1906-01-01",
 "facetCreatedDate": {
    "century": "20th century",
    "dayOfWeek": "Monday",
    "decadeOfCentury": "1900s",
    "era": "Common Era (CE)",
    "monthOfYear": "January",
    "temporal": "1906-01-01",
    "verbatim": "01 Jan 1906 / 31 Dec 1906",
    "year": "1906"
  },
  "verbatimCreatedDate": "1906"

Here the original value from the collection management system is verbatimCreatedDate: 1906. The createdDate is an ISO date approximation of that value. This is not always a good approximation, especially if there is not enough precision in the original data. The facetCreatedDate contains facetable values for that date, and date range approximations:

Note that interpretation of dates is a fairly complex problem. We do not claim to return entirely reliable values for the facetable date fields within the Collections API, but hope to be able to offer a useful (and machine readable) addition to regular date fields. By making date aspects facetable, we hope to be able assist users with exploring the wealth of data.

Through the advanced search interface you can ask for all available facets on a date:

POST https://data.tepapa.govt.nz/collection/search

{
  "query" : "*",
  "size" : 5,
  "facets": [ {
    "field": "facetBirthDate",
    "size": 5
  } ]
}

This returns a list of all facetable sub-fields of that date, in the result set:

  "facets": {
    "facetBirthDate.monthOfYear": {
      "December": 223,
      "October": 239,
      "Unknown": 6278,
      "March": 222,
      "January": 224
    },
    "facetBirthDate.era": {
      "Unknown": 107,
      "Common Era (CE)": 8677
    },
    "facetBirthDate.decadeOfCentury": {
      "1910s": 556,
      "1900s": 562,
      "1880s": 652,
      "1890s": 614,
      "1860s": 562
    },
    "facetBirthDate.century": {
      "16th century": 107,
      "20th century": 3659,
      "18th century": 559,
      "17th century": 149,
      "19th century": 4078
    },
    "facetBirthDate.dayOfWeek": {
      "Monday": 352,
      "Thursday": 357,
      "Unknown": 6349,
      "Sunday": 353,
      "Saturday": 350
    }
  },

Nested data fields

Nested objects only expose a lower level of detail. On date fields, this is usually visible by the omission of facetable date fields. You can usually retrieve those by requesting the root-level entity directly.

staplegun commented 5 years ago

Added in new wiki page - https://github.com/te-papa/collections-api/wiki/Search-strategies