Adding provenanceInformation

openactive-archive / conformance-services

Harvests and normalises OpenActive Opportunity feeds to a common representation

MIT License

0 stars 0 forks source link

Adding provenanceInformation #51

Open rhiaro opened 4 years ago

rhiaro commented 4 years ago

Each NormalisedEvent should have an additional field called provenanceInformation which contains:

feedUrl: the URL of the feed the event came from
publisherName: the name of the publisher of the feed the event came from
derivedFrom: a list of URLs containing the id of the original event, plus the id of the original event's parent if applicable. If no suitable (globally unique) id can be found, append the value of identifier to the URL of the feed.

Rather than doing a SQL query across 3 tables in the pipes, we may want to consider de-normalising some of the database tables so publisher name and feed url are in their own columns in the raw_data table.

We need to consider whether we do this in the normalisation pipes, or add it at the end when the RDPE feed is generated.

rhiaro commented 4 years ago

Normalisations

rhiaro commented 4 years ago

Currently the raw data downloader stores the id field from the top level of the RDPE feed (in the data_id db field). These are usually short strings or occasionally UUIDs; I haven't seen one example with a URI yet. So it's obviously not the value of id or @id from the data at least most of the time. In some (most?) cases (eg) the id on the RDPE level is not even present in any field in the data.

For derivedFrom we need the value of id or @id if present in data.

If we want to add derivedFrom outside of the pipes, we need to store the correct id (pulled from the data) in the database at the time of raw_data storage, so we can look it up for parents. If we add derivedFrom inside the pipes, this shouldn't be a problem.

(Though we may still want to revisit whether we bother storing the RDPE id at the top level in the database, becuase I don't think it's useful for anything..)

nickevansuk commented 4 years ago

If helpful:

RPDE id is just useful for record-level synchronisation between systems when harvesting. It simply identifies records in the same feed, to allow processing of updates. It has no wider relevance.
Everything within data is part of the linked data structure of the opportunity data. @id/id are optional for data publishing, and mandatory when Open Booking API is used.

nickevansuk commented 4 years ago

That said if you wanted to pin-point a particular data item in a feed (such as to help a data publisher debug a validation issue), you would need both the RPDE page URL and the RPDE id.

See https://github.com/openactive/conformance-services/issues/14#issuecomment-647491665 for more info

rhiaro commented 4 years ago

Okay, now i understand the RDPE id better, thanks.

@thill-odi The wiki examples are ambiguous. Do you want us to use the id from the RDPE feed (glued to the feed url) or the id/@id value from the data for the provenance information, in the case where these are different?

thill-odi commented 4 years ago

Hi, @rhiaro: the former (id from the RPDE feed) in preference.

odscjames commented 4 years ago

Can I get us to to take a step back here and think about other options?

As I understand it, the user story of this feature is so that if your looking at a piece of normalised data that is excellent or bad in some way, you want to be able to trace it back to the raw data it come from so you can debug?

Is that correct?

If so, how about not adding anything extra to the data, but instead, adding a small end point to this app, something like /normalised-data/ID/info, were ID is the id field in the piece of normalised data? We store links between database records, so at this point it would be easy to return a blob of JSON with the one or two raw data items it came from, the actual content of those raw items and more info on them.

Just to double check with @rhiaro: The id in the normalised data will always be there and will always be unique? (From how the database works I think so)

rhiaro commented 4 years ago

Add provenanceInformation block

thill-odi commented 4 years ago

@odscjames: The API endpoint idea is a good one, providing that it's workable. The point is indeed understanding/debugging data items, so another endpoint with the full representation is indeed preferable to te restricted view in provenanceInformation.