openactive-contrib / harvester

Harvester for slurping OA feed data, written in Node.js
MIT License
0 stars 1 forks source link

Data Normalisations #31

Closed odscjames closed 4 years ago

odscjames commented 4 years ago

Questions

Hi Tim,

So far I haven't been able to find any Schedule data (at the top level, nor looking for the eventSchedule property on any Events, nor looking for eventSchedule on any subEvents) or any EventSeries data. Does that sound right? Might it be nested somewhere else that I haven't thought of?

I also have a couple more clarification questions about enhancing the data.

  1. How much of the organizer info should we pull out? Is indexing just the name (and/or legalName) enough, or do we need to make it an object and keep more like the contact details or location? Some publishers are providing more than just name, but many are not.

  2. We'd like a bit more advice on how and when to derive organizer information from the publisher. Is there a particular set of publishers which are always the organiser? (In which case we could set this in a settings variable somewhere and only pull it in those cases, and take the publisher-name from the API.)

2.b FacilityUse has 'provider' but not 'organizer'. Are these equivalent (for our purposes)?

  1. For any Event without a name or description for whatever reason, should we populate the name field with something useful for display, or just leave it empty?

Thanks!

Amy

Answer

Hi, Amy,

These are good questions. Comments are interleaved below, with a full response at the end.

So far I haven't been able to find any Schedule data (at the top level,
nor looking for the eventSchedule property on any Events, nor looking
for eventSchedule on any subEvents) or any EventSeries data. Does that
sound right? Might it be nested somewhere else that I haven't thought of?

eventSchedule is pretty frequent in the dataset (see e.g. https://bookwhen.com/api/openactive/event_types). Unfortunately it can live at various levels in the hierarchy. It's not always present, as publishers may instead provide a startDate. But you'll see it pretty often; I think all the Legend feeds use them.

I also have a couple more clarification questions about enhancing the data.

1. How much of the organizer info should we pull out? Is indexing just
the name (and/or legalName) enough, or do we need to make it an object
and keep more like the contact details or location? Some publishers are
providing more than just name, but many are not.

Bof. Looking at a sample of the data, the semantics of 'organizer' are a bit loopy. Sometimes it refers to the direct provider of an activity; sometimes to a consortium of such providers. Meanwhile, the field was intended to capture the immediate point of contact for Events.

2. We'd like a bit more advice on how and when to derive organizer
information from the publisher. Is there a particular set of publishers
which are always the organiser? (In which case we could set this in a
settings variable somewhere and only pull it in those cases, and take
the publisher-name from the API.)

See below.

2.b FacilityUse has 'provider' but not 'organizer'. Are these equivalent
(for our purposes)?

Yes. I guess the intended semantic difference is that a 'facility' isn't 'organised' as such. But for current purposes they can be regarded as identical.

3. For any Event without a name or description for whatever reason,
should we populate the name field with something useful for display, or
just leave it empty?

Ummm. Something that indicates that the name is missing in the source data: 'NAME NOT PROVIDED' or similar.

Reviewing the above, can we adopt the following solution?

1, For 'organizer', just pull out the name

  1. Create a separate 'Data Source' (camel- or snake-cased as appropriate) field to capture the name of the dataset site
  2. Take 'provider' a 'organizer'

Does that make sense?

Tim

rhiaro commented 4 years ago

Closing because all of these have been spun out into new issues where appropriate