Collections and Pagination

SiBell commented 4 years ago

We need some agreement on how we manage Collections (if we call them that... this might for example be a collection of platforms, thus paginated, as Si) that aren't ObservationCollections. Examples of how we do that would be either hydra:Collection or rdf:Bag.

It's also entirely possible that you might have all your platforms in one API (a lamp post API, say) and all your sensors in another (an air quality API, say) and all your historic observations in another (an observation collection API, say) and they would all just link to each other.

We also need an entrypoint that directs clients to these collections as a starting point. In other words, when I hit https://api.example.com it gives me links to a collection of sensors, a collection of platforms, a collection of observations, etc. It wouldn't need to give me all of those necessarily, you might not have a collection of all observations from all sensors (which could be huge, but might be useful), you might only have collections of observations under each sensor.

In theory, this would/could look something like...

GET https://api.example.com/

{
  "@context": {
    "@base": "https://api.example.com/",
    "uo": "https://urbanobservatory.github.io/standards/vocabulary/latest/",
    "title": "http://purl.org/dc/terms/title",
    "collections": {
      "@id": "uo:EntrypointCollections",
      "@container": "@id"
    }
  },
  "collections": {
    "/sensors": {
      "@type": ["@id", "uo:Collection", "uo:SensorCollection"],
      "title": "All sensors available in Newcastle upon Tyne"
    }
  }
}

Originally posted by @lukessmith in https://github.com/urbanobservatory/standards/issues/18#issuecomment-578498701

SiBell commented 4 years ago

I like the idea of using the term Collections so that it's in keeping with an ObservationCollection.

I also like the idea of having a list of collections available from the entry point.

SiBell commented 4 years ago

With regards to the pagination. Here's my crack at an example and we can tweak/dismiss it if required.

So the user goes to the entry point: https://api.urbanobservatory.com/ and is presented with the following JSON response:

{
  "@context": {
    "@base": "https://api.urbanobservatory.com/",
    "uo": "https://urbanobservatory.github.io/standards/vocabulary/latest/",
    "sosa": "http://www.w3.org/ns/sosa/",
    "title": "http://purl.org/dc/terms/title",
    "collections": {
      "@id": "uo:EntrypointCollections",
      "@container": "@id"
    }
  },
  "collections": {
    "/sensors": {
      "@type": ["@id", "uo:Collection", "uo:SensorCollection"],
      "title": "All sensors available in Newcastle upon Tyne"
    },
    "/observations": {
      "@type": ["@id", "uo:Collection", "sosa:ObservationCollection"],
      "title": "All the observations collected by the urban observatory"
    }
  }
}

The user then follows the link to the observations collection and is presented with the following:

{
  "@context": {
    "@base": "https://api.urbanobservatory.com/",
    "uo": "https://urbanobservatory.github.io/standards/vocabulary/latest/",
    "sosa": "http://www.w3.org/ns/sosa/", 
    "totalItems": "https://www.hydra-cg.com/spec/latest/core/#hydra:totalItems",
    "member": "https://www.hydra-cg.com/spec/latest/core/#hydra:member"
    "view": "https://www.hydra-cg.com/spec/latest/core/#hydra:view"
  },
  "@id": "https://api.urbanobservatory.com/observations?offset=0&limit=100&sortBy=resultTime",
  "@type": ["@id", "uo:Collection", "sosa:ObservationCollection"]
  "totalItems": "4980",
  "member": [
    {
      "madeBySensor": "thermistor-37f3kd"
      "resultTime": "2020-01-27T14:28:18.393Z",
      "hasResult": {
        "value": "22.9"
      }
    },
    {etc, etc}    
  ],
  "view": {
    "@id": "https://api.urbanobservatory.com/observations?offset=0&limit=100&sortBy=resultTime",
    "@type": "PartialCollectionView",
    "next": "/observations?offset=100&limit=100&sortBy=resultTime",
  }
}

And then if you follow the next link you'll end up with:

{
  "@context": {
    "@base": "https://api.urbanobservatory.com/",
    "uo": "https://urbanobservatory.github.io/standards/vocabulary/latest/",
    "sosa": "http://www.w3.org/ns/sosa/", 
    "totalItems": "https://www.hydra-cg.com/spec/latest/core/#hydra:totalItems",
    "member": "https://www.hydra-cg.com/spec/latest/core/#hydra:member",
    "view": "https://www.hydra-cg.com/spec/latest/core/#hydra:view"
  },
  "@id": "https://api.urbanobservatory.com/observations?offset=0&limit=100&sortBy=resultTime",
  "@type": ["@id", "uo:Collection", "sosa:ObservationCollection"]
  "totalItems": "4980",
  "member": [
    {
      "madeBySensor": "hygrometer-234fs"
      "resultTime": "2020-01-27T15:28:18.393Z",
      "hasResult": {
        "value": "82.3"
      }
    },
    {etc, etc}    
  ],
  "view": {
    "@id": "https://api.urbanobservatory.com/observations?offset=100&limit=100&sortBy=resultTime",
    "@type": "PartialCollectionView",
    "previous": "/observations?offset=0&limit=100&sortBy=resultTime"
    "next": "/observations?offset=200&limit=100&sortBy=resultTime",
  }
}

I personally prefer the term links, as used by JSON:API, for holding the next and previous links, but view is ok if we want to stick with hydra's terminology.

I've shown examples here with offset, limit and sortBy, e.g. ?offset=0&limit=100&sortBy=resultTime, but individual observatories may wish to paginate in a slightly different way if it's more performant for them, e.g. ?page=2.

Is it potentially a pain for end-users if we only show partial URI's e.g. /observations rather than https://api.urbanobservatory.com/observations, as I'm guessing some browsers will let the user click on complete links and go straight to them.

Guessing we don't need to have any special HTTP headers, e.g. as described here, if we're handling the next and prev links in the JSON response?

I also wonder if there's a way of preventing common share properties from being repeated. For example if all members of the collection share exactly the same madeBySensor or inDeployment property is there a way of only including it once. I was hoping the ObservationCollection docs would give an example, but they don't.

lukeshope commented 4 years ago

My strong preference for pagination is to avoid using JSON-LD for next/prev links. The problem with this is how do you describe how to jump to a specific page, or searching of the collection.

This is what I believe JSON Schema should be used for, because it has more flexibility, like defining validation on query parameters.

Based IRIs shouldn't be an issue if the elements are expanded in code first, using the JSON-LD algorithms. This is something the library I've been working does automatically.

SiBell commented 4 years ago

So we have a meta object instead? As in your example here.

And the user can look at the schema for more details on the pagination properties? E.g. what the maximum value for the limit can be.

lukeshope commented 4 years ago

Yeah, it doesn't have to be a meta object, it could be anything really, but the schema would reference an element in the document using a JSON pointer, #/meta/current for example. The templatePointers in this bit are an example.

I admit I don't know much about JSON:API though, so there might be another way. The one other thing in JSON Schema's favour though is that it is now fully aligned with OpenAPI (as of a few weeks back).

SiBell commented 4 years ago

Great to hear they're aligned.

I'm struggling a little to see how we'll code this up in practise. Are we nearing a point where we could create a really basic Node.js application that serves some dummy observatory data using the approaches discussed?

Guessing it will have the following:

Some JSON Schema files that define the data model.
A JSON Hyper Schema file or an OpenAPI YAML file that defines the API interface itself, e.g. the querystring parameters.
Some middleware that uses these schema files to validate incoming requests.
Adds in the JSON-LD parts to the response, e.g. populates all the links to the various SSN, SOSA, UO, etc, definitions we've used.
Can easily auto-generate documentation ensuring it stays in sync with the API itself.

This blog post introduces a few libraries that may help.

SiBell commented 4 years ago

Probably worth ensuring that any solution we decide upon can also handle a cursor-based approach rather than just an offset-based approach. Comparison of the two approaches here.

SiBell commented 4 years ago

Ok what do we think of this as an approach. A user makes the following request for observations:

GET https://api.urbanobservatory.ac.uk/observations?madeBySensor=thermometer-6A7

To which they get the following back:


{
  "@context": [
    "https://api.urbanobservatory.ac.uk/context/collection.jsonld",
    "https://api.urbanobservatory.ac.uk/context/observation.jsonld"
  ],
  "@id": "https://api.urbanobservatory.ac.uk/observations?madeBySensor=thermometer-6A7",
  "@type": [
    "Collection"
  ],
  "member": [
    {"@id": "observation-1002500", "etc": "etc"},
    {"@id": "observation-1002499", "etc": "etc"}
    .
    .
    {"@id": "observation-1002401", "etc": "etc"}
  ],
  "meta": {
    "current": {
      "@id": "https://api.urbanobservatory.ac.uk/observations?madeBySensor=thermometer-6A7&sortBy=resultTime&sortOrder=desc&resultTime__lte=2020-03-20T16:42:55.033Z&offset=0&limit=100",
      "madeBySensor": "thermometer-6A7",
      "sortBy": "resultTime",
      "sortOrder": "desc",
      "resultTime": {
        "lte": "2020-03-20T16:42:55.033Z"
      },
      "offset": 0,
      "limit": 100
    },
    "next": {
      "@id": "https://api.urbanobservatory.ac.uk/observations?madeBySensor=thermometer-6A7&sortBy=resultTime&sortOrder=desc&resultTime__lte=2020-03-20T16:42:55.033Z&offset=100&limit=100",
      "madeBySensor": "thermometer-6A7",
      "sortBy": "resultTime",
      "sortOrder": "desc",
      "resultTime": {
        "lte": "2020-03-20T16:42:55.033Z"
      },
      "offset": 100,
      "limit": 100
    },
    "count": 100,
    "total": 18456
  }
}

Key points

The meta objects for current and next not only contain the links, but also detail the parameters used to construct the link. Having these parameters easily accessible can be useful to frontend applications. For example if a user clicks a next button on the webpage the parameters may be added to the end of the URL in the browser's address bar.
The original request had a query string parameter to filter by sensor, therefore this is included in the meta objects.
Properties such as sortBy and sortOrder were not explicitly set in original request, but the server added default values for them, these are included in the response so that it's clear to the user what the defaults are.
Likewise an upper limit for the resultTime wasn't explicitly set in the original request, but to ensure that the offset always "offsets" from the same point in time this parameter is added. This time is either the time of the request or the time of the most recent observation in the current set of observations. It made sense for resultTime to be an object rather than "resultTime__lte": "2020-03-20T16:42:55.033Z" as it would be difficult to define what the key resultTime__lte means, whereas it's far easier to define what resultTime and lte mean.
If we followed the next link, then the meta object would then contain a previous object. We might also want to allow last and first objects.
The count and total properties detail how many items are in this collection, and how many items in total are available on the server-side respectively.

This seems like a nice solution to me, although I wonder if I'm essentially replicating what JSON Schema/Hyper-Schema is supposed to achieve.

Joe-Heffer-Shef commented 4 years ago

As far as my experience goes, this looks like a nice solution (it's better than most data endpoints, anyway.) You've put your finger on my reservations here:

This seems like a nice solution to me, although I wonder if I'm essentially replicating what JSON Schema/Hyper-Schema is supposed to achieve.

Surely it's re-inventing the wheel to invent a homebrew pagination system?

Also, I don't understand where there is an offset and limit parameter here? I have in mind the blog post from Slack where they contrast offets vs. cursors for iterating through large datasets.

SiBell commented 4 years ago

Surely it's re-inventing the wheel to invent

Definitely worth avoiding this where possible. I'll raise this point on the technical call tomorrow and see what everyone thinks. Let us know if you wish to join @Joe-Heffer-Shef.

I don't understand where there is an offset and limit parameter here?

@lukessmith and I had a quick chat about this offline. Our conclusion being that there's use cases for either. If we do decide to adopt my approach above then there's no reason why we couldn't swap out the offset and limit properties for a cursor instead. However, we felt that when it came to requesting observations the offset, limit approach made more sense. Our worry with the cursor approach is that it could get rather complex to manage on the server/database side. The cursor approach relies on having a unique sequential column in your database table. Initially the resultTime sounds like an obvious choice for this, but then we'd get into issues when multiple observations occur at the same time. In which case do you use a sequential row index instead, but then if you do want the observations returned in chronological order, or perhaps ordered by madeBySensor then this becomes tricky.

The obvious downside with the offset, limit approach is that we have streams of data coming in all the time, and thus the starting point for our offset could be changing all the time. However, the following lines in my example provide a nice solution for this:

"resultTime": {
  "lte": "2020-03-20T16:42:55.033Z"
},

And we can always add a note in our docs tellings users to be aware of duplicates when requesting paginated observations.

Joe-Heffer-Shef commented 4 years ago

Yes, I'd like to attend the meeting tomorrow, please.

I can see that the difficulty solving this problem arises from the same sources as many other challenges in the observatories i.e. heterogeneous data sources and unknown/varied usage patterns.

SiBell commented 4 years ago

Yep you've hit the nail on the head.

The call is at 11:00 tomorrow on zoom. Could you send me quick message via this contact form, so I can send you the zoom details. Alternatively drop Patricio Ortiz an email as he'll be on the call too (I assume you've met).

urbanobservatory / standards

Collections and Pagination #20