Pipelining workflows across participants

w3c-cg / dataspaces

Other

7 stars 0 forks source link

Pipelining workflows across participants #1

Open pietercolpaert opened 3 months ago

pietercolpaert commented 3 months ago

Challenge Description

Across dataspace participants, the same processing on top of the same datasets are done multiple times. The full ecosystem would benefit from re-use of processing pipelines, or re-use a dataset that already can lower the amount of work that needs to be done to reach the desired end-result.

In order for these ideas to become a reality, we will however need a way for consumers to express a desired outcome based on a data source, and can understand what parts of a workflow to reach that outcome can be outsourced to, that may already have processed that dataset for someone else.

Impact and Importance

There are multiple reasons for reusing parts of the processing of intermediary dataspace participant. Among others:

Trust: intermediary participants might translate a dataset from a highly confidential dataset to a shareable dataset. I.e., raw ANPR cameras (number-plates) towards traffic counts (“5 heavy vehicles passed by in the last hour”).
Cost-efficiency: Building and maintaining a custom server implementation will be more expensive than sharing the costs with other consumers on this part of the data processing.

Desired Solution

A way to express a desired outcome of the dataset
A translation of that desired outcome to one or more processing plans
A way for participants to indicate they can solve a part of the processing plan

It would also be nice if there could be a techno-economic analysis of the business model of such value-adding participants: what kinds of business plans would this way become viable?

Acceptance Criteria

Your proposal includes each of the 3 points in the desired solution.

References and Resources

Processing plans could ring a bell in query processing with e.g., SPARQL algebra
P-Plan and Prov-O
As the data needs to be processed incrementally, also looking Linked Data Event Streams could be interesting
IDSA Rulebook
feel free to comment more ideas

PeterKoen-MSFT commented 3 months ago

Data Intermediaries are not desired in the design principles of Dataspaces and are considered a necessity where required by regulation. While you are right that in a Data Pipeline there might be Intermediaries, in the context of Dataspaces any intermediary is a Dataspace Participant and Data Sharing occurs only peer-to-peer. There is no such thing as a mandatory Dataspace Intermediary (despite different claims by some organizations) as this would inhibit the Sovereignty (Autonomy and Agency) of an individual participant. Any form of intermediation (unless legally required) happens at the level of the business process, not at the level of data sharing in the dataspace.

A Pipeline with multiple parties processing the data in order only exists at the business process level, not at a technical dataspace level. Therefore this discussion should keep this in mind that at the technical architecture level the dataspace protocol, the decentralized claims protocol and any basic profiles assume a peer-to-peer relationship of actors.

However, on a higher level business layer the gap you are pointing out absolutely exists and needs to be addressed. It potentially addresses both sides of the transaction: the originator of the original dataset might want to restrict which intermediaries are acceptable processors and the recipient might want to see provenance data on where and how the data has been processed before it arrived at the consumer. This can be potentially addressed by semantic models for data contract policies and semantic models describing the provenance of data.

Personally I don't think that linked data in any form is the way to go. Any transaction between two peers in the dataspace needs to be able to evaluate policies and execute without further involvement of additional 3rd parties (unless required by regulation). Any data negotiated upon two parties and sharing agreed on through a data sharing contract needs to be under full control of a single participant at the point of negotiation. (however, it might contract policies origination from previous data providers if they are forward facing in the pipeline.)

Please, have a look at the IDSA Rulebook for more background information on the current thinking on the mechanisms of a dataspace.

pietercolpaert commented 3 months ago

Thank you for your comment! It certainly helps me to better understand different angles to dataspaces.

Data Intermediaries are not desired in the design principles of Dataspaces and are considered a necessity where required by regulation. While you are right that in a Data Pipeline there might be Intermediaries, in the context of Dataspaces any intermediary is a Dataspace Participant and Data Sharing occurs only peer-to-peer. There is no such thing as a mandatory Dataspace Intermediary (despite different claims by some organizations) as this would inhibit the Sovereignty (Autonomy and Agency) of an individual participant. Any form of intermediation (unless legally required) happens at the level of the business process, not at the level of data sharing in the dataspace.

I understand why intermediaries are not-desired although being a necessity for trust-related scenarios. I believe however there is also a desired potential cost-efficiency to be found.

I.e.: for the EU Cultural Heritage dataspace, in which Europeana plays the role of an aggregator (which I consider as a type of intermediary), it will be interesting for a consumer to use the intermediary service of Europeana to for example find all works across in Europe by Van Gogh, than the consumer asking each individual museum in Europe about it itself. Europeana could advertise its service as a full-text process on top of multiple source datasets, while a consumer can still decide to use a different aggregator, or to just set-up something on their own.

... the gap you are pointing out absolutely exists and needs to be addressed. It potentially addresses both sides of the transaction: the originator of the original dataset might want to restrict which intermediaries are acceptable processors and the recipient might want to see provenance data on where and how the data has been processed before it arrived at the consumer. This can be potentially addressed by semantic models for data contract policies and semantic models describing the provenance of data.

:+1:

Please, have a look at the IDSA Rulebook for more background information on the current thinking on the mechanisms of a dataspace.

I’ll add this to the references list!

PeterKoen-MSFT commented 3 months ago

This is one of the big misunderstandings in dataspaces, mostly driven by wrong narratives of not technically versed individuals: Intermediaries are detrimental to trust and not actually creating trust.

Trust is created by matching policies (technical or business requirements) to proofs of claims. If enough policies can be matched to existing proofs a threshold of trust can be reached that allows for the sharing of data. Any intermediary weakens this trust as it is an additional man in the middle that can potentially create false matches between policies and claims. However, what is needed are Trust Anchors that can act as the root of certifying the validity of the claims. Those are not direct intermediares of the data sharing process, but rather external oracles that confirm a specific information with a cryptographic proof.

Think of this the way that you are experiencing daily when you surf the www: when you need a secure connection to a website you use https, which is using certificates to encrypt your connecting, and those certificates eventually need to be confirmed by a root certificate at which point the trust chain ends. Trust anchors for dataspaces work exactly the same way.

You can draw the analogy that what the WWW did for human consumption of data on the web, dataspaces are now aiming to provide for machine agents to share data in a trusted way: There is the dataspace protocol which describes how you discover data, negotiate a contract and orchestrate the sharing, then there is the dataspace decentralized claim protocol which describes how you create trust by evaluating the matching of policies and claims to create trust. And then there is the need for semantic models on multiple levels: on the level of policies and claims, on the level of data interoperability and at the level of business processes to orchestrate data pipelines in the ecosystem.

To address the second part of the comment: A data platform that is managing data on behalf of others or is aggregating data can itself be a participant in a dataspace. At that point it acts as the sovereign party of the dataspace on one side of its business and on the other side it acts as the aggregator of data towards its members. If those members need to provide guidance as to how the data shall be used in the dataspace you will need a semantic model at the business layer which allows to express these wishes as policies to be used in the data contract negotiation in the dataspace. And you might need a semantic model expressing data usage policies for the data once shared with other participants in the dataspace.

The important part is to separate between what a participant in a dataspace is on a technical layer and what a participant in a data pipeline on a business layer is. Think of the dataspace just as the trusted pipe enabling the sharing of data, but in no way does it represent the end-to-end data ecosystem. Plenty of other pieces will still be needed (e.g. Enterprise Level Data Management Systems that need to be aware of data sharing contracts (usage policies), Presentation Layers, etc...) to create an end-to-end-solution.

pietercolpaert commented 3 months ago

I meant trust in a different way, as in the example given in the issue above:

As the data source provider, let’s say a local police department, of ANPR data, I do not trust a consumer at the mobility department of the city with raw ANPR data. However, for reaching their goals, the mobility department only needs traffic counts. This can be derived from the ANPR data by using an intermediary services that may be provided by the federal/national police that does this for other local police departments as well. Thanks to the intermediary (which I understand may not qualify to your/IDSA’s exact definition of intermediary) the data source provider can share a derived form of the data with the mobility department.

Think of the dataspace just as the trusted pipe enabling the sharing of data, but in no way does it represent the end-to-end data ecosystem.

I think this is where the W3C CG might want to diversify in scope as I believe we do want to explore the technical aspects of dataspace ecosystems.

PeterKoen-MSFT commented 3 months ago

The service described in your example is not an intermediary in the sense how it is used in the dataspace. It would be yet another participant. This participant can offer an algorithm service. To enhance trust in this service it can be implemented with confidential compute, which can be proven with claims about the processing environment and provable data on runtime environment, running software, etc.

It's important to note that this service is a participant adding a value added optional service to the dataspace. And thus it is up to the sovereignty of other participants whether to trust that service or not and to demand certain treatment of the data and express it in the data sharing contract policies.

To follow through on your example: the police department could have policies in their data sharing contract offer that says that you need to provide a confidential compute environment, storing data at rest at an encrypted data storage, that the encryption keys are externally managed (by the police department) and will only be provided to the confidential compute enclave once it has proven to run a very specific version number of a specific open source project which has been reviewed and deemed trustworthy. Additionally it could contain a policy that the result of the computation can only be shared with another participant bearing a specific cryptographic key, which would then be shared in another data sharing contract with the consumer at the mobility department.

What this example clearly shows is that at the technical level everyone is a participant and data can only be shared peer-to-peer once policy requirements have been met. However, on the level of the business process the data ecosystem might span multiple parties, of which some are acting as "business intermediaries".

Btw. this is also how Observability in dataspaces works: any Auditor/Monitoring Agent is just another participant that would need to negotiate a data contract with the two other parties that are sharing data to receive their log data and be allowed to process it (or how to process it) - yet another problem space that will require a semantic model to provide practical solutions.

Last but not least: some of those problems are already being worked on in the Eclipse Dataspace Working Group Specification projects. It might be worthwhile to check those out and potentially join those projects to reduce redundancies, overlaps and conflicts. Happy to provide more information if you are interested!

pietercolpaert commented 3 months ago

I learned that I will have to avoid using the word data intermediary as of course it quickly gets the connotation of a data intermediation service, while I simply intended to point at a dataspace participant that provides data based on another source. I’ve adapted the issue’s title accordingly.

PeterKoen-MSFT commented 3 months ago

Yes, indeed, the main problem with Intermediary is that the term is hopelessly overloaded and not clearly defined anywhere...

Btw, one more thing I noticed in the figure in your first post: Vocabulary Hub. Following the logic of participants or external anchors of information the Vocabulary Hub can be either a participant who provides specific vocabularies through data sharing contracts, or it can be an external oracle providing a reference source of semantic information. In the IDSA dataspace architecture there is no longer a specific role of a Vocabulary Hub as the problem is perfectly solved with the two options above.

btw: Same for other elements that used to be in older IDSA architectures like the broker, the participant information system of marketplaces: either they are participants that need to adhere to the dataspace rules or they are external providers of public information.

And yes, for ease of implementation the IDSA architecture does accept catalogs as a special role, but in an idealized dataspace architecture every participant provides a catalog of the data sharing contracts offered by that participant and also operates a crawler to find the offerings of other participants.

in that same logic a marketplace or offer search engine would just be yet another participant.

You see: we will need a lot of standardized semantic models to enable all those value added services on top of the generic dataspace participant model! :)

JohannesLipp commented 3 months ago

+1 for @PeterKoen-MSFT's point that creating pipelines of dataspace services (or data offerings and processing of such) is more on a business level than on a dataspace level.
Just to add to the IDSA (e.g. Rulebook) and Eclipse Dataspace Working Group: Gaia-X defines so-called Service Offerings that could be used to describe the input, processing, and output of a (data processing) service offered in a Gaia-X dataspace. See the top (user plane) of this overview picture and the (short) section on Resource and Service Offerings in the Gaia-X Architecture Document. Unfortunately, these definitions are short and high level.

It might still be relevant to describe available (data processing) services in dataspaces and the chaining of such, as motivated by @pietercolpaert. This task could include to dig into the layer model(s) of dataspaces (point 1 in this comment). (Btw @PeterKoen-MSFT are they defined somewhere? In my recent dissertation, I used the layers data, dataspace, and domain). And this task could also include to dig into the descriptions of services offerings (or data offerings, depending on the POV) and map/chain them for a business layer use case.

JohannesLipp commented 3 months ago

Plus, let me support your (somewhat already completed 😇) discussion on intermediaries/participants on different levels:

The recent article "Industrial data ecosystems and data spaces" distinguished these, see this screenshot of their table 2:

Note that this is again/still strictly seen on a dataspace level, not on a use case / business level.

My suggestion from the end of my previous comment would be: Find a way to properly define the (data processing) services on the right of this figure (dataspace), and then find a way to chain them based on their interfaces to address the requirements of a business level use case.

PeterKoen-MSFT commented 3 months ago

@JohannesLipp - we have not started documenting the business layers in IDSA, it's something that would be great to do in the IDSA Rulebook, but I admit that I've been more focused on the technical layer then the business ecosystem.

The way I think about this is that there are "layers" as well as "planes". A Layer is from the perspective of the ecosystem while a plane is the implementation stack to participate in that ecosystem at the Participant.

Business Layers:

Dataspaces
Data Platforms, Data Markets, ...
Regulation / Consortia / Communities
Data Ecosystem

Participant Stack:

Control Plane (Contract Negotiation, Sharing Orchestration)
Data Plane (Actual Data Sharing/Transfer/Exchange)
Management Plane (Enterprise Data Management Systems, Catalogs,...)
Application Plane (Analysis, AI, Presentation,...)

A participant is part of a data ecosystem, which accommodates multiple regulatory regimes, specific rules of the consortium/community, etc..., Data Platforms and Data Markets might offer their services in that Ecosystem, Participants might use them to discover each other... Once two participants want to share data they can use dataspace technologies to negotiate a data sharing contract and use any applicable data technology to share the data.

Two participants use the control plane of the dataspace stack to create trust and negotiate a contract. They use the data plane to share the data (transfer or code2data). The results of the sharing process is a dataset with associated policies which needs to be managed by the participants data management system (e.g. usage policy conformance needs to be assured/monitored), processed by an application (analytics software, AI engine, etc...) at which point additional policies might have to be processed (e.g. do not use data to train AI engine, for research only,...) and last but not least the results need to be processed by an application where it is presented to the user or acted upon by a machine agent to fulfill the desired business case.

That leads me to a feedback regarding the Data Spaces part of the shared graphic: there is one small inaccuracy: within the dataspace the interaction is always 1:1, at the ecosystem level it will mostly be a n:m relationship to enable complex use cases.

Btw: Awesome paper, I've read it already on Monday :)

pietercolpaert commented 3 months ago

I again adapted the challenge description. I moved away from actively letting another participant process the data, but towards the rationale towards not doing the same processing over and over again in the same dataspace, when another dataset in the dataspace is already accessible that can replace a part of the pipeline on the consumer’s side.

Wrt. the technical layer vs. the business ecosystem, and planes vs. layers: I still don’t understand why the distinction is important. Is it because the technical layer is the focus of the Eclipse dataspace working group? Or is there another challenge that is being solved by that? Is that a kind of challenge we could document in a new issue?

PeterKoen-MSFT commented 3 months ago

The reason I'm focusing on separating the technical layer and the business layer is that I'm looking at Dataspaces as a decentralized, multi-agent network. Participants are nodes that enable trusted data sharing by implementing the required, standardized protocols. Those nodes are business model neutral. Dataspaces as a community/consortium implement a specific business model, or at the very least support it. One Participant (node) can be a member of many different dataspaces, and thus participate in many different business models.

Basically like the web: as a participant, you have one browser, but within that browser you can have one tab open to a marketplace, another one to a forum, a third one to a social media site... all different business models, but with a standardized participation through a set of protocols.

Same for Dataspace Participants in the future: You will use a Dataspace Connector (Gateway Service on the edge of your architecture) that will be able to connect to other Participants. Which dataspace and what the business model/community rules are will be decided by the context of the Policies and Claims provided to the Connector. E.g. a manufacturing company can participate in a supply-chain dataspace, an energy one, a regulatory reporting one, etc... like a tab in your browser, just for machine agents instead of humans.