openactive / modelling-opportunity-data

OpenActive Modelling Opportunity Data specification
https://www.openactive.io/modelling-opportunity-data/
Other
6 stars 6 forks source link

Partial Schedule generation threshold #159

Closed nickevansuk closed 5 years ago

nickevansuk commented 6 years ago

The data below contains another example where there is some useful data held within the Schedule of the SessionSeries that is not intended to generate ScheduledSessions

Suggest we formalise the generation threshold by having some wording in the specification to the effect of Data consumers are only expected to expand a schedule into a calendar of events if the following conditions are met ... or otherwise include an explicit property e.g. "generateSchedule": true that would trigger generation for data consumers.

Ideally if "generateSchedule": true, the data consumer could display:

Next sessions: Sat 22 Sep at 11:00, Sat 29 Sep at 11:00

If "generateSchedule": false, the data consumer is expected not to generate, and only to display the schedule information as-is:

Next sessions: Saturdays at 11:00am

This is important as in some systems (e.g. EMD) "Saturdays at 11:00am" is all that is stored (and the user is likely to want to double-check whether the event is indeed happening this Saturday, especially if this Saturday happens to be Christmas Day), whereas in other systems the schedule is designed to generate bookable slots and hence "Sat 22 Sep at 11:00" would be accurate and reliable, as an exceptDate will likely get added by the activity provider to ensure they don't get any bookings on Christmas day.

Otherwise the expected rendering of a schedule is left ambiguous.

{
  "@context": [
    "https://openactive.io/",
    "https://openactive.io/ns-beta"
  ],
  "type": "ScheduledSession",
  "identifier": 71350,
  "superEvent": {
    "identifier": "WN006115",
    "type": "SessionSeries",
    "programme": {
      "type": "Brand",
      "name": "Walking Netball",
      "description": "Walking Netball has evolved from a growing demand for walking sports. Often, one of netball’s strengths is that people never forget playing the sport and the memories as well as the love for the game never leave.\r\n\r\nWalking Netball is a slower version of the game; it is netball, but at a walking pace. The game has been designed so that anyone can play it regardless of age or fitness level.\r\n\r\nFrom those who have dropped out of the sport they love due to serious injury, to those who believed they had hung up their netball trainers many years ago, it really is for everyone.\r\n\r\nWomen the length of the country have begun playing the game of Walking Netball for the fun, laughter and camaraderie the social session brings, as much as the health benefits on offer. It can give those who feel isolated an outlet, provide an activity for those who don’t deem themselves fit enough to run anymore and offer a stepping stone for those looking for a pathway back into netball.",
      "logo": {
        "type": "ImageObject",
        "url": "https://www.englandnetball.co.uk/app/uploads/2016/04/Walking-Netball.jpg"
      },
      "url": "https://www.englandnetball.co.uk/my-game/walking-netball/",
      "beta:video": "https://www.youtube.com/watch?v=qzQfnv7sFPg"
    },
    "name": "Walking Netball",
    "description": "Walking Netball has evolved from a growing demand for walking sports. Often, one of netball’s strengths is that people never forget playing the sport and the memories as well as the love for the game never leave.\r\n\r\nWalking Netball is a slower version of the game; it is netball, but at a walking pace. The game has been designed so that anyone can play it regardless of age or fitness level.\r\n\r\nFrom those who have dropped out of the sport they love due to serious injury, to those who believed they had hung up their netball trainers many years ago, it really is for everyone.\r\n\r\nWomen the length of the country have begun playing the game of Walking Netball for the fun, laughter and camaraderie the social session brings, as much as the health benefits on offer. It can give those who feel isolated an outlet, provide an activity for those who don’t deem themselves fit enough to run anymore and offer a stepping stone for those looking for a pathway back into netball.",
    "activity": [
      {
        "id": "https://openactive.io/activity-list#aba839fb-5cd2-4042-b651-c09c86bce1e2",
        "type": "Concept",
        "prefLabel": "Walking Netball",
        "inScheme": "https://openactive.io/activity-list"
      }
    ],
    "category": [
      {
        "type": "Concept",
        "prefLabel": "Coached Sessions"
      }
    ],
    "level": [
      "Beginner"
    ],
    "eventSchedule": {
      "type": "Schedule",
      "endDate": "2018-10-30",
      "repeatFrequency": "P14D"
    },
    "organizer": {
      "type": "Organization",
      "name": "England Netball",
      "url": "https://www.englandnetball.co.uk"
    },
    "sameAs": [
      "https://www.facebook.com/southdurhamandclevelandnetball/",
      " Netball In South Durham & Cleveland @NetballInSDandC"
    ],
    "location": {
      "type": "Place",
      "name": "Thornaby Pavillion",
      "address": {
        "type": "PostalAddress",
        "postalCode": "TS17 9EW",
        "addressCountry": "GB"
      },
      "geo": {
        "type": "GeoCoordinates",
        "latitude": 54.5386474,
        "longitude": -1.290952
      }
    },
    "offers": [
      {
        "type": "Offer",
        "name": "Full price cost",
        "price": 3.5,
        "priceCurrency": "GBP"
      }
    ],
    "leader": [
      {
        "type": "Person",
        "name": "Liz James",
        "email": "izziescissors@hotmail.co.uk"
      }
    ]
  },
  "eventStatus": "https://schema.org/EventScheduled",
  "startDate": "2018-09-11T11:00:00Z",
  "endDate": "2018-09-11T12:00:00Z",
  "duration": "PT1H",
  "url": "https://www.englandnetball.co.uk/?pagename=session&sessionid=[71350&WN006115"
}
ldodds commented 6 years ago

I don't think we should be dictating how consumers should process the data.

A data consumer may decide not to expand a schedule into a calendar of events at all, regardless of the detail. It depends on their needs, application, etc.

If anything, I'd prefer to do the opposite: indicate to publishers what information is required in order to generate a useful schedule. If that isn't available then they can't expect a consumer to generate one reliably. And a consumer should be warned that they can't process it.

I'm not clear why this is marked as "blocking" as the data seems legal based on current draft?

nickevansuk commented 6 years ago

I guess the difference in semantics is differentiating:

(a) "Saturdays at 11:00am" is all that is stored (and the user is likely to want to double-check whether the event is indeed happening this Saturday, especially if this Saturday happens to be Christmas Day). May have maximumAttendeeCapacity in SessionSeries but will not have remainingAttendeeCapacity in subEvents as that level of detail is not stored, and it will not have subEvents (and generated subEvents are entirely fictional)

(b) The schedule is designed to generate bookable slots and hence "Sat 22 Sep at 11:00" would be accurate and reliable, as an exceptDate will likely get added by the activity provider to ensure they don't get any bookings on Christmas day. Also likely to have remainingAttendeeCapacity in subEvents and maximumAttendeeCapacity in subEvent or SessionSeries.

EMD is (a), Bookwhen and BookingBug are (b).

If we don’t provide a way to differentiate on this we're at risk of misleading users based on inaccurate information, as treating (a) as (b) would lead to events being generated that are not necessarily accurate.

There is currently no indication in the data of which type of schedule this is.

Marked as "blocking" as affects EMD data

nickevansuk commented 6 years ago

@petermeldrum @Jadecation - any thoughts on this? Do EMD and Open Sessions store (a) or (b) above and is it worth differentiating?

petermeldrum commented 6 years ago

Hi @nickevansuk we are currently (a) as well. Although soon to be moving to sub events to allow for remaining capacity.

Jadecation commented 6 years ago

Hi @nickevansuk we are currently (a). I agree that there needs to be a flag/tag to inform data consumers that the data is (a) so that they can inform consumers through their searches etc. i.e. 'This activity may not run every week, please check with organiser'

nickevansuk commented 6 years ago

Ok great, thanks both!

In which case I propose a boolean of something like either "generateSchedule": false or "isSummaryScheduleOnly": true to differentiate these

Any thoughts on names welcome!

ldodds commented 6 years ago

Thanks all for the input on what level of detail is available. Its useful to understand what data people have.

However I'm still not convinced that adding a flag like generateSchedule is actually necessary, or the best option here.

The original intention for Schedule was that it should be a machine-readable description of a repeating calendar entry. The specification in the earlier drafts reflected this and included a number of required fields that made the Schedule useful for generating a calendar, driving booking, etc.

For events that are regular, but subsequent to change, then we have subEvent and sounds from @petermeldrum that this is useful to help communicate extra details about forthcoming events in addition to recording that they are actually running.

For scenarios like a) where a platform only has a very simplistic schedule "Saturday's at 11am", then there are other options:

My plan was to document that a data consumer SHOULD only generate events from a schedule if: it has a scheduledEventType, startTime, endTime, one of the byXXX properties, etc. If these aren't provided then a consumer can't generate a reliable schedule. This is particularly important for scheduledEventType as they wouldn't know what type of Event to generate and hence run the risk of not applying rules of inheritance of properties, etc correctly.

Where, e.g. EMD, cannot provide a complete Schedule and don't feel like including the schedule in a description then they can just include a partial schedule as we currently specify. An extra flag to indicate the quality/completeness of the data is unnecessary, as it's clear from the level of detail provided.

As a bit of feedback on @nickevansuk comment here:

If we don’t provide a way to differentiate on this we're at risk of misleading users based on inaccurate information, as treating (a) as (b) would lead to events being generated that are not necessarily accurate.

As noted above, they can be differentiated. But even in the case where a system did generate a schedule, this won't necessarily be problematic. I might still want to project dates into the future to, for example, tell someone there's an event they can attend when booking a trip.

I'm keen to avoid directives that tell consumers how they should process or display data and instead focus on conformance rules that indicate the quality/completeness of data instead.

nickevansuk commented 6 years ago

@ldodds for the reasons you've indicated around "I might still want to project dates into the future to, for example, tell someone there's an event they can attend when booking a trip", partial Schedule does seem preferable to Event description in terms of our recommendations/guidance. Even if the user needs to be told to "double check with the organiser" they might still find it useful to know when the session is running. Also allows the session to be part of dayOfWeek filters etc.

^ so in summary partial Schedule allows for the broadest number of usecases.

It's worth pointing out that the property proposed above is not intended to tell data consumers how they should process or display data, and is instead a signal to data consumers what type of data it is.

The data is captured two quite different ways, so shouldn't be considered the same. (b) is the gold standard, but (a) data still exists in many systems.

Classfinder data entry (a)

Only captures basic properties

screen shot 2018-09-13 at 15 31 23

Bookwhen data entry (b)

Expands the schedule and allows sessions to be cancelled and rescheduled etc.

screen shot 2018-09-13 at 15 33 05

Inference vs Explicit Property

So essentially we're saying instead of a specific property to indicate the granularity / type of the data, it might be better to have rules that can be used to infer the data granularity / type based on the data supplied?

Inference pros:

Explicit Property pros:

We could use scheduledEventType as the explicit property from which to drive validation and display decisions, which sounds like a good compromise, but we still need to be clear whether e.g. a scheduledEventType without an endDate is an error or should be read as a partial Schedule.

I'm mainly wary about implying anything from partial data as data publishers are really good at accidentally supply partial data across a range of properties, and data consumers need to do their best to deal with this already. It seems like an important property such as this should be made explicit to create the best chance for both providers and consumers understand each other, and being able to flag errors in implementation where they exist.

ldodds commented 6 years ago

a scheduledEventType without an endDate is an error or should be read as a partial Schedule

It's not an error to have a Schedule without an endDate even for a full machine-readable Schedule.

It merits a warning to publishers, that they should provide one if possible, and to consumers about risks of projecting too far ahead, but doesn't seem like an error to me.

Inference vs Explicit

its not inference. To generate a proper subEvent, you need a minimum set of data, if you don't have that then you cannot correctly generate the required data.

You might be able to use the data for other purposes though, e.g. "this event usually runs on Saturday's at 11am".

Adding an explicit property to indicate whether the Schedule is complete enough just seems redundant, and also prone to further error (e.g. what if its excluded by mistake, or what if its included and other properties are missing). with the changes I've outlined I think we can provide clear guidance to both publishers and consumers with minimum of extra properties/data.

nickevansuk commented 6 years ago

Just to document a conversation @ldodds and I had over Slack on this just now:

Key Issue: Processing hints like doX: true are poor modelling and in practice aren't effective. there are lots of circumstances across published data where it may be more or less suited to specific use cases, we don't use processing hints elsewhere.

Analogous usecase to restate problem: The requirement isn't for processing hints, more a statement of the mode of data collection. If there was a dataset of lat/lng locations for bikes marking some as “GPS” and some as “Cell Tower”, or even “estimated accuracy” to indicate whether they are positioned with exact precision or e.g. within 300m. As a data user might choose not to use the Cell Tower or >200m accuracy data for particular use cases. If the accuracy isn’t stated and it’s all lat/lngs then the data consumer has no information to make that distinction.

New proposal: PartialSchedule to be added as an additional type to describe type (a) data above. A PartialSchedule is not expected to be fully accurate, so inference should not be made about concrete instances by data consumers. This allows us to tighten up the conformance criteria around Schedule and allow PartialSchedule to be more loose.

Notes:

nickevansuk commented 6 years ago

Further notes on this issue:

Additional context

Further restatement of problem

Usecase 1

Usecase 2

nickevansuk commented 6 years ago

(from @ldodds)

So PartialSchedule covers Usecase 2.

Challenges for Usecase 1

Concerns over data quality and accuracy apply to more than just schedules. Its just that a series with a schedule might suffer from it more because of it being projected out into the future.

nickevansuk commented 6 years ago

(from @nickevansuk)

dateModified is problematic

As above, dateModified is problematic as it is valid to not modify an event for years, if that event genuinely hasn't changed - which means that it is legitimate to not update a schedule in a year if the classes are running reliably.

Could we use proximity to BAU?

The best proxy we’ve found so far to this type of accuracy/reliability is the system’s proximity to BAU. If it takes bookings and transacts it’s almost certainly going to be high on the provider’s priority list. Also if it’s perceived to have a large audience (e.g “Let’s Ride”).

Most of the systems with high quality data that have been opened to date have been central to BAU in one form or another (regulatory, financial, etc).

Open Sessions and EMD are outliers, but there will be more like them as we get into the tail. This certainly applies to events too, less to facilityUse as generally those systems are master of record and already tied to BAU, but could still be possible.

So indicators of BAU:

So the property/tag/Boolean we were talking about was really for the third case, to say “we know this is not 100% trustworthy” rather than “we think this is exactly accurate”

Idea being that all other systems in the first two categories will just not be trusted by data consumers if they end up being unreliable, so may not be used as data sources, as they have tried to be part of BAU and failed.

But those in the third category understand their shortcomings and don’t try to be BAU, so are true to what they profess to be.

Ideally we’d give that third type of system a way to legitimately share the lower level of data granularity they have so they can publish a compliant feed within OpenActive, with data consumers being able to use the data by adding the appropriate caveats / UI.

As an aside: It's worth considering adding something to OA accreditation around this (as is already the case within the contract recommendations), specifically:

nickevansuk commented 6 years ago

@ldodds: I'm not convinced by this approach, it sounds like "if you have limited data, or poor data governance, use this type. Otherwise use Schedule". I can't see that trying to bake this into the model will really help. It looks like the issue is actually at the feed / provider / system level.

@nickevansuk: The other details in the event are likely to still be accurate, as they’re much less time sensitive. And anything significant like a studio closure or other location move would likely be a trigger for providers to update their many secondary systems. So it’s really schedules that are the main issue here from a practical perspective... the idea is give the non-BAU systems a way of expressing a schedule that’s not a "statement of total truth", but a "statement of intention". Arguably this kind of data is still valid (“Saturday at 11am”) and if presented correctly is still useful.

@ldodds: I'm still not convinced that we should use an extra property or type to try to capture what is basically someone's data quality/management practices as they apply to just one portion of the data they're publishing.

nickevansuk commented 6 years ago

(from @nickevansuk)

I guess it depends on the definition here... are we talking about “data management / data quality” or “data accuracy”. In the bike example above “accuracy” seems like a legitimate field? As would “sample rate”?

Using bike example, with an accuracy field set to <200m, or sample rate to 15mins:

{
  "type": "Bike",
  "geo": {
    "type": "GeoCoordinates",
    "latitude": 54.5386474,
    "longitude": -1.290952,
    "accuracy": {
      "type": "QuantitativeValue",
      "maxValue": 200,
      "unitCode": "MTR"
    }
  },
  "averagePollWaitDuration": "PT15M"
}

Suggest that "accuracy" / "sample rate" fields are not about “data quality / management”. Rather they are a statement about the accuracy of the data. The data could be very high data quality and still low accuracy? (As long as the “accuracy” field itself is correct!)

Ways we could solve this problem:

expectedUpdateFrequency

{
  "type": "Event",
  "expectedUpdateFrequency": "P6M"
  "eventSchedule": {
     "type": "Schedule",
     "startDate": "2017-01-01"
     "endDate": "2017-12-31",
     "repeatFrequency": "P1W", 
     "byDay": [ "https://schema.org/Monday" ],
     "startTime": "06:30"
  }
}

So for real-time or near-real-time (for booking systems) could have a “PT0S” or "PT15M" respectively? Anything >15min indicates a non-primary system, and data consumers can decide if and how they display the data accordingly.

{
  "type": "Event",
  "expectedUpdateFrequency": "PT15M"
  "eventSchedule": {
     "type": "Schedule",
     "startDate": "2017-01-01"
     "endDate": "2017-12-31",
     "repeatFrequency": "P1W", 
     "byDay": [ "https://schema.org/Monday" ],
     "startTime": "06:30"
  }
}

Note this is only an "expected" frequency as an indicator of accuracy, rather than an exact accuracy, in the same way as "<200m" was an indicator.

Jadecation commented 6 years ago

Hi both, my point to share is that in terms of the user cases A and B that we cannot presume or think that all 'A's will work towards or end up being 'B's over time. In a lot of cases in this industry the organisations that hold data will only ever hold the data relevant to A and would only ever move to B if they wanted to implement a booking system which a significant proportion will not. We need a system/ standards that works with this reality. In terms of expressing data accuracy the above all seems quite complicated to me! The current reality is that someone would have a comment box next to the day/time info that could say something like 'i dont run my activity during the school holidays, please ring for more info' and we need a sensible data solution for replicating this - maybe something like a flag so data consumers can add a 'warning' to check whether an activity is running before turning up.

nickevansuk commented 6 years ago

@Jadecation I can see you what you mean regarding complexity, thinking through further we might have got hung up on some more meta stuff there, it almost feels like we actually have a different type of data - "timetable" data.

Reflecting on my previous bike analogy, it's actually not that useful: this is more like the difference between a published bus timetable and a live feed of bus arrival times.

To summarise this thread, with a new proposal:

For "live data" publishing:

For "timetable data" publishing:

Propose that TimetableSchedule also has an additional recommended "additionalInformation" or "description" property which can be used for "i dont run my activity during the school holidays"

Note that all types of Schedule may be extrapolated for certain usecases such as search or discovery.

ldodds commented 6 years ago

Based on discussion so far, here's how I'm planning to proceed for now. Taking in @Jadecation feedback and the broader discussion:

I think there's two immediate needs here:

There are some broader concerns around how consumers handle and process schedules which I still think should be handled in other ways, rather than trying to build information in the data model. They are issues with how data is managed, interpreted and presented to end users.

On that basis, for this version of the specification I am going to add:

I am going to incorporate that into a new draft.