Closed nickevansuk closed 5 years ago
I don't think we should be dictating how consumers should process the data.
A data consumer may decide not to expand a schedule into a calendar of events at all, regardless of the detail. It depends on their needs, application, etc.
If anything, I'd prefer to do the opposite: indicate to publishers what information is required in order to generate a useful schedule. If that isn't available then they can't expect a consumer to generate one reliably. And a consumer should be warned that they can't process it.
I'm not clear why this is marked as "blocking" as the data seems legal based on current draft?
I guess the difference in semantics is differentiating:
(a) "Saturdays at 11:00am" is all that is stored (and the user is likely to want to double-check whether the event is indeed happening this Saturday, especially if this Saturday happens to be Christmas Day). May have maximumAttendeeCapacity in SessionSeries but will not have remainingAttendeeCapacity in subEvents as that level of detail is not stored, and it will not have subEvents (and generated subEvents are entirely fictional)
(b) The schedule is designed to generate bookable slots and hence "Sat 22 Sep at 11:00" would be accurate and reliable, as an exceptDate will likely get added by the activity provider to ensure they don't get any bookings on Christmas day. Also likely to have remainingAttendeeCapacity in subEvents and maximumAttendeeCapacity in subEvent or SessionSeries.
EMD is (a), Bookwhen and BookingBug are (b).
If we don’t provide a way to differentiate on this we're at risk of misleading users based on inaccurate information, as treating (a) as (b) would lead to events being generated that are not necessarily accurate.
There is currently no indication in the data of which type of schedule this is.
Marked as "blocking" as affects EMD data
@petermeldrum @Jadecation - any thoughts on this? Do EMD and Open Sessions store (a) or (b) above and is it worth differentiating?
Hi @nickevansuk we are currently (a) as well. Although soon to be moving to sub events to allow for remaining capacity.
Hi @nickevansuk we are currently (a). I agree that there needs to be a flag/tag to inform data consumers that the data is (a) so that they can inform consumers through their searches etc. i.e. 'This activity may not run every week, please check with organiser'
Ok great, thanks both!
In which case I propose a boolean of something like either "generateSchedule": false
or "isSummaryScheduleOnly": true
to differentiate these
Any thoughts on names welcome!
Thanks all for the input on what level of detail is available. Its useful to understand what data people have.
However I'm still not convinced that adding a flag like generateSchedule
is actually necessary, or the best option here.
The original intention for Schedule
was that it should be a machine-readable description of a repeating calendar entry. The specification in the earlier drafts reflected this and included a number of required fields that made the Schedule
useful for generating a calendar, driving booking, etc.
For events that are regular, but subsequent to change, then we have subEvent
and sounds from @petermeldrum that this is useful to help communicate extra details about forthcoming events in addition to recording that they are actually running.
For scenarios like a) where a platform only has a very simplistic schedule "Saturday's at 11am", then there are other options:
that information could be in the Event description
. if the expectation is that these looser schedules should only be displayed as text, rather than being expanded out, then this avoids the issue
the information is included in a partial Schedule
. the relaxed requirements around required properties for Schedule
, including making scheduledEventType
recommended as per #157 makes this a non-issue.
My plan was to document that a data consumer SHOULD only generate events from a schedule if: it has a scheduledEventType
, startTime
, endTime
, one of the byXXX
properties, etc. If these aren't provided then a consumer can't generate a reliable schedule. This is particularly important for scheduledEventType
as they wouldn't know what type of Event to generate and hence run the risk of not applying rules of inheritance of properties, etc correctly.
Where, e.g. EMD, cannot provide a complete Schedule and don't feel like including the schedule in a description
then they can just include a partial schedule as we currently specify. An extra flag to indicate the quality/completeness of the data is unnecessary, as it's clear from the level of detail provided.
As a bit of feedback on @nickevansuk comment here:
If we don’t provide a way to differentiate on this we're at risk of misleading users based on inaccurate information, as treating (a) as (b) would lead to events being generated that are not necessarily accurate.
As noted above, they can be differentiated. But even in the case where a system did generate a schedule, this won't necessarily be problematic. I might still want to project dates into the future to, for example, tell someone there's an event they can attend when booking a trip.
I'm keen to avoid directives that tell consumers how they should process or display data and instead focus on conformance rules that indicate the quality/completeness of data instead.
@ldodds for the reasons you've indicated around "I might still want to project dates into the future to, for example, tell someone there's an event they can attend when booking a trip", partial Schedule
does seem preferable to Event
description
in terms of our recommendations/guidance. Even if the user needs to be told to "double check with the organiser" they might still find it useful to know when the session is running. Also allows the session to be part of dayOfWeek filters etc.
^ so in summary partial Schedule
allows for the broadest number of usecases.
It's worth pointing out that the property proposed above is not intended to tell data consumers how they should process or display data, and is instead a signal to data consumers what type of data it is.
The data is captured two quite different ways, so shouldn't be considered the same. (b) is the gold standard, but (a) data still exists in many systems.
Only captures basic properties
Expands the schedule and allows sessions to be cancelled and rescheduled etc.
So essentially we're saying instead of a specific property to indicate the granularity / type of the data, it might be better to have rules that can be used to infer the data granularity / type based on the data supplied?
Inference pros:
Explicit Property pros:
We could use scheduledEventType
as the explicit property from which to drive validation and display decisions, which sounds like a good compromise, but we still need to be clear whether e.g. a scheduledEventType
without an endDate
is an error or should be read as a partial Schedule
.
I'm mainly wary about implying anything from partial data as data publishers are really good at accidentally supply partial data across a range of properties, and data consumers need to do their best to deal with this already. It seems like an important property such as this should be made explicit to create the best chance for both providers and consumers understand each other, and being able to flag errors in implementation where they exist.
a scheduledEventType without an endDate is an error or should be read as a partial Schedule
It's not an error to have a Schedule
without an endDate
even for a full machine-readable Schedule
.
It merits a warning to publishers, that they should provide one if possible, and to consumers about risks of projecting too far ahead, but doesn't seem like an error to me.
Inference vs Explicit
its not inference. To generate a proper subEvent
, you need a minimum set of data, if you don't have that then you cannot correctly generate the required data.
You might be able to use the data for other purposes though, e.g. "this event usually runs on Saturday's at 11am".
Adding an explicit property to indicate whether the Schedule is complete enough just seems redundant, and also prone to further error (e.g. what if its excluded by mistake, or what if its included and other properties are missing). with the changes I've outlined I think we can provide clear guidance to both publishers and consumers with minimum of extra properties/data.
Just to document a conversation @ldodds and I had over Slack on this just now:
Key Issue: Processing hints like doX: true
are poor modelling and in practice aren't effective. there are lots of circumstances across published data where it may be more or less suited to specific use cases, we don't use processing hints elsewhere.
Analogous usecase to restate problem: The requirement isn't for processing hints, more a statement of the mode of data collection. If there was a dataset of lat/lng locations for bikes marking some as “GPS” and some as “Cell Tower”, or even “estimated accuracy” to indicate whether they are positioned with exact precision or e.g. within 300m. As a data user might choose not to use the Cell Tower or >200m accuracy data for particular use cases. If the accuracy isn’t stated and it’s all lat/lngs then the data consumer has no information to make that distinction.
New proposal: PartialSchedule
to be added as an additional type to describe type (a) data above. A PartialSchedule
is not expected to be fully accurate, so inference should not be made about concrete instances by data consumers. This allows us to tighten up the conformance criteria around Schedule
and allow PartialSchedule
to be more loose.
Notes:
urlTemplate
and idTemplate
in a PartialSchedule
, as the whole point is that they’re not reliable, if they were it should be a proper Schedule
Further notes on this issue:
(from @ldodds)
So PartialSchedule
covers Usecase 2.
Concerns over data quality and accuracy apply to more than just schedules. Its just that a series with a schedule might suffer from it more because of it being projected out into the future.
(from @nickevansuk)
As above, dateModified is problematic as it is valid to not modify an event for years, if that event genuinely hasn't changed - which means that it is legitimate to not update a schedule in a year if the classes are running reliably.
The best proxy we’ve found so far to this type of accuracy/reliability is the system’s proximity to BAU. If it takes bookings and transacts it’s almost certainly going to be high on the provider’s priority list. Also if it’s perceived to have a large audience (e.g “Let’s Ride”).
Most of the systems with high quality data that have been opened to date have been central to BAU in one form or another (regulatory, financial, etc).
Open Sessions and EMD are outliers, but there will be more like them as we get into the tail. This certainly applies to events too, less to facilityUse as generally those systems are master of record and already tied to BAU, but could still be possible.
So indicators of BAU:
So the property/tag/Boolean we were talking about was really for the third case, to say “we know this is not 100% trustworthy” rather than “we think this is exactly accurate”
Idea being that all other systems in the first two categories will just not be trusted by data consumers if they end up being unreliable, so may not be used as data sources, as they have tried to be part of BAU and failed.
But those in the third category understand their shortcomings and don’t try to be BAU, so are true to what they profess to be.
Ideally we’d give that third type of system a way to legitimately share the lower level of data granularity they have so they can publish a compliant feed within OpenActive, with data consumers being able to use the data by adding the appropriate caveats / UI.
As an aside: It's worth considering adding something to OA accreditation around this (as is already the case within the contract recommendations), specifically:
@ldodds: I'm not convinced by this approach, it sounds like "if you have limited data, or poor data governance, use this type. Otherwise use Schedule". I can't see that trying to bake this into the model will really help. It looks like the issue is actually at the feed / provider / system level.
@nickevansuk: The other details in the event are likely to still be accurate, as they’re much less time sensitive. And anything significant like a studio closure or other location move would likely be a trigger for providers to update their many secondary systems. So it’s really schedules that are the main issue here from a practical perspective... the idea is give the non-BAU systems a way of expressing a schedule that’s not a "statement of total truth", but a "statement of intention". Arguably this kind of data is still valid (“Saturday at 11am”) and if presented correctly is still useful.
@ldodds: I'm still not convinced that we should use an extra property or type to try to capture what is basically someone's data quality/management practices as they apply to just one portion of the data they're publishing.
(from @nickevansuk)
I guess it depends on the definition here... are we talking about “data management / data quality” or “data accuracy”. In the bike example above “accuracy” seems like a legitimate field? As would “sample rate”?
Using bike example, with an accuracy field set to <200m, or sample rate to 15mins:
{
"type": "Bike",
"geo": {
"type": "GeoCoordinates",
"latitude": 54.5386474,
"longitude": -1.290952,
"accuracy": {
"type": "QuantitativeValue",
"maxValue": 200,
"unitCode": "MTR"
}
},
"averagePollWaitDuration": "PT15M"
}
Suggest that "accuracy" / "sample rate" fields are not about “data quality / management”. Rather they are a statement about the accuracy of the data. The data could be very high data quality and still low accuracy? (As long as the “accuracy” field itself is correct!)
Ways we could solve this problem:
{
"type": "Event",
"expectedUpdateFrequency": "P6M"
"eventSchedule": {
"type": "Schedule",
"startDate": "2017-01-01"
"endDate": "2017-12-31",
"repeatFrequency": "P1W",
"byDay": [ "https://schema.org/Monday" ],
"startTime": "06:30"
}
}
So for real-time or near-real-time (for booking systems) could have a “PT0S” or "PT15M" respectively? Anything >15min indicates a non-primary system, and data consumers can decide if and how they display the data accordingly.
{
"type": "Event",
"expectedUpdateFrequency": "PT15M"
"eventSchedule": {
"type": "Schedule",
"startDate": "2017-01-01"
"endDate": "2017-12-31",
"repeatFrequency": "P1W",
"byDay": [ "https://schema.org/Monday" ],
"startTime": "06:30"
}
}
Note this is only an "expected" frequency as an indicator of accuracy, rather than an exact accuracy, in the same way as "<200m" was an indicator.
Hi both, my point to share is that in terms of the user cases A and B that we cannot presume or think that all 'A's will work towards or end up being 'B's over time. In a lot of cases in this industry the organisations that hold data will only ever hold the data relevant to A and would only ever move to B if they wanted to implement a booking system which a significant proportion will not. We need a system/ standards that works with this reality. In terms of expressing data accuracy the above all seems quite complicated to me! The current reality is that someone would have a comment box next to the day/time info that could say something like 'i dont run my activity during the school holidays, please ring for more info' and we need a sensible data solution for replicating this - maybe something like a flag so data consumers can add a 'warning' to check whether an activity is running before turning up.
@Jadecation I can see you what you mean regarding complexity, thinking through further we might have got hung up on some more meta stuff there, it almost feels like we actually have a different type of data - "timetable" data.
Reflecting on my previous bike analogy, it's actually not that useful: this is more like the difference between a published bus timetable and a live feed of bus arrival times.
To summarise this thread, with a new proposal:
For "live data" publishing:
Schedule
- a compact representation of live event occurrence dataPartialSchedule
- subclass of Schedule with loose constraints, representing an anticipated schedule in the future, which is not expected to be presented as concrete occurrences (specific dates/times) via extrapolation - useful for conveying "Every other Wednesday" even if live event occurrence data (e.g. in subEvent
) only stretches 4 weeks ahead.For "timetable data" publishing:
TimetableSchedule
- subclass of Schedule which represents a "published timetable" rather than live occurrence data. Also not expected to be presented as concrete occurrences from extrapolation, but differentiates this so data consumers can present this as a "bus timetable" rather than as a live view (thinking of the way CityMapper does this for timetables vs. live bus arrivals).Propose that TimetableSchedule
also has an additional recommended "additionalInformation
" or "description
" property which can be used for "i dont run my activity during the school holidays"
Note that all types of Schedule may be extrapolated for certain usecases such as search or discovery.
Based on discussion so far, here's how I'm planning to proceed for now. Taking in @Jadecation feedback and the broader discussion:
I think there's two immediate needs here:
Event
, e.g. "I don't run this during school holidays"There are some broader concerns around how consumers handle and process schedules which I still think should be handled in other ways, rather than trying to build information in the data model. They are issues with how data is managed, interpreted and presented to end users.
On that basis, for this version of the specification I am going to add:
Event
called schedulingNote
which organizers can use to add notes like "I don't run this in school holidays" or "We run this every week". By making this a separate property rather than using description
then its consumers may be able to do more with it. Its a property of Event because it can apply more broadly than just to a Schedule
PartialSchedule
as a new type which will remain flexible. It will be provided so that publishers can include details such as "Every Weds" in a more machine-readable form than just a note. Schedule
will be made stricter to require more properties, e.g. a specific start and end time.I am going to incorporate that into a new draft.
The data below contains another example where there is some useful data held within the
Schedule
of theSessionSeries
that is not intended to generateScheduledSession
sSuggest we formalise the generation threshold by having some wording in the specification to the effect of
Data consumers are only expected to expand a schedule into a calendar of events if the following conditions are met ...
or otherwise include an explicit property e.g."generateSchedule": true
that would trigger generation for data consumers.Ideally if
"generateSchedule": true
, the data consumer could display:If
"generateSchedule": false
, the data consumer is expected not to generate, and only to display the schedule information as-is:This is important as in some systems (e.g. EMD) "
Saturdays at 11:00am
" is all that is stored (and the user is likely to want to double-check whether the event is indeed happening this Saturday, especially if this Saturday happens to be Christmas Day), whereas in other systems the schedule is designed to generate bookable slots and hence "Sat 22 Sep at 11:00
" would be accurate and reliable, as an exceptDate will likely get added by the activity provider to ensure they don't get any bookings on Christmas day.Otherwise the expected rendering of a schedule is left ambiguous.