solid / specification

Solid Technical Reports
https://solidproject.org/TR/
MIT License
490 stars 45 forks source link

Delegated computation (proposal) #393

Open renyuneyun opened 2 years ago

renyuneyun commented 2 years ago

Currently, Solid provides storage and Solid Apps provide computation locally. If a user or an App want to perform server-side computation (e.g. those illustrated in #390), it must be done behind their own solutions and/or own server. This is not transparent and not accountable, and is a limitation of current mechanism.

390 gives one potential direction for that. This proposal targets at slightly different problems, thus coming up with different solutions. Compare with #390, this proposal focuses more on flexibility, and could provide more capabilities. It reduces some problems in #390, but brings alternative ones.

Note: this proposal becomes quite lengthy. You may read “1. Description” and “2. Computation initiation points & use cases” first to get the general idea. You may want to skip “3. Mechanisms and interactions” and ”6. Additional benefits and long-term considerations".

1. Description

This design could be called delegated computation.

Similar to #390, this design expect a standard way to use the server to perform computation. Different from #390, this design does not assume only the Solid server could do the computation, but any trusted and/or compatible servers.

We could call a server providing computation the computation provider (CP), the body triggered the computation the computation trigger (CT), and the pod which has the configuration and the data. The Solid Apps also sometimes play a role in the design.

By saying Solid App, it means the thing very similar to the original / current concept of an App.  Currently, an App interacts with the Solid server to fetch and store data, but performs computation on its own. In this design, the only difference is that an App can optionally request to perform delegated computation – do some computation on remote server first and obtain the result.

The next section presents the points where the delegated computation could happen / be initiated, and then present use cases to illustrate the benefits this proposal brings.

2. Computation initiation points & use cases

There are mainly three types of uses for the delegated computation:

  1. Event-based or scheduled (cron) computation jobs, similar to #390
  2. On-demand computation, initiated by a Solid App
  3. Dynamic generation of resources (transparent to Apps), by configuration of user

The data flows in these cases are not completely the same, but are from a basic pattern. This is explained in the next section.

Use cases

At the baseline, this proposal is a modular version of #390, which separated the pod (Solid server) and the computation provider.

But it also enables other use cases. Here illustrates the major ones:

  1. Everything of #390
    1. Cron jobs on the Solid pod
    2. Event-based computation jobs, such as that done by ActivityPub's bot
  2. Computation to be performed on a server (CP), then transmitted to the pod or CT, which enables:
    1. Computation performed on any compatible servers
      1. Reducing Solid server's workload (compared to #390)
      2. Allowing flexible choice of servers performing computation
      3. Similar to #390, maybe additional business models for pod providers (different levels of computation / storage capability)
    2. Apps showing better trustworthiness to the user, by doing data annonymization (through delegated computation) before App receives data
      • Requires trust of a CP instead; requires App to accept such data transmission
    3. Shadowing / transparent pre-processing of data
      • This is done via dynamic generation of resources
      • Better privacy-protected for (untrusted) Apps – data is annonymized/desensitized before App receives them; App is unaware of this behaviour
    4. Dynamic resource generation
      • Similar to exposing additional API endpoints
      • Currently only doable by extending the server code (e.g. Solid Calendar Store)

3. Mechanisms and interactions

As mentioned earlier, there are three types of computation initiation points. They have different interactions and data flows. But in general they all follow this pattern:

  1. CT triggers the delegated computation
  2. The Solid pod selects / finds the designated CP
  3. Computation job is sent to the CP, by the CT
  4. CP performs computation
  5. Status returned to the CT

Depending on the type, they are more specifically described in their own subsections.

Event-based or scheduled jobs

In this setting, the pod also acts as the CT, and potentially CP too if taking the design in #390. In fact, this is very similar to #390, with the only addition of the ability to specify CPs.

The trigger is specified by the user, similar to that as proposed in #390. In addition to #390, the trigger can additionally specify the desired CP. There are several possible designs, depending on two factors:

  1. Is the trigger specified with the triggering condition, or in a dedicated configuration/setting file (e.g. /settings/prefs)?
  2. Is the trigger one identifier (e.g. URL), or a selector (with conditions)?

Therefore, the data flow is: pod → CP → pod.

On-demand computation

In this setting, the App acts as the CT. The user does not need to do anything beforehand.

At a certain stage in the App (probably before requiring some sensitive data), the App triggers the delegated computation. The next is the same as above, except that the App sends the computation job to the CP in this setting. The App (as the CT) receives the status handle, and requests the result data at the appropriate time (e.g. when the job finishes).

There is a design question on where the App receives the result data from: the pod or the CP? That involves two different types of data flows:

  1. pod → CP → pod → App?
  2. pod → CP → App?

Dynamic resource generation / Shadowing data

In this setting, the pod acts as  the CT. The user need to specify the trigger and computation job beforehand.

One main usage of this is to allow the user to specify pre-processing (as a computation job) of data. So when an App requires the data, the computation job is launched first, and the data request is on hold. After finishing (or via streaming), the result data (from the computation job) is sent to the App instead of the original data stored in the pod.

This is why it's call shadowing: the original data is shadowed, and the replacement (pre-processed) data is returned instead.

This is essentially the dynamic generation of resources – “dynamic” when “requesting”. The target resource and the original resource do not have to be the same resource.

(One of my personal projects projfs can be a similar (but not exactly the same) demonstration of the possibility.)

Therefore, this is different from event-based or scheduled jobs (also #390), in the way how the result data is dealt with:

 This is also different from on-demand computation, because:

To fulfill the transparency, the data flow should be: pod → CP → pod → App.

Caching may be needed.

This can enabled additional usages, such as mirroring of data. (Alternative mechanisms may be introduced as well for this specific purpose, e.g. sameAs.)

4. Configuration example

The configuration contains two parts:

  1. server specification / setting;
  2. computation job specification.

Remember the two different choices for them mentioned earlier:

  1. Is the trigger specified with the triggering condition, or in a dedicated configuration/setting file (e.g. /settings/prefs)?
  2. Is the trigger one identifier (e.g. URL), or a selector (with conditions)?

In the example, we assume the CPs are specified in configuration file, and is as URLs. Assume the relevant terms/classes are under solid prefix.

Example CP configuration file

Assume the file is /settings/cp:

@prefix solid: <http://www.w3.org/ns/solid/terms#> .

<#cp-setting>
    a solid:ComputationProviderSetting ;
    solid:default <#server-default> .

<#server-default>
    a solid:ComputationProvider ;
    solid:url "https://url.to.provider" .

<#server-alternative>
    a solid:ComputationProvider ;
    solid:url "https://url2.to.provider2" .

This specifies two servers: <#server-default> and <#server-alternative>; it also sets <#server-default> as the default CP – the CP used when no explicit CP is specified in the job-side configuration.

Job-side configuration

These are the specification of the computation jobs. They can optionally specify the CP instead of the default one.

The schema/ontology is for illustrative purposes. It needs to be carefully designed afterwards.

Event-based or scheduled jobs

This is similar to #390, so I'm borrowing and adapting the example from there. The file is located under a specific folder that the pod recognizes and triggers jobs.

@prefix solid: <http://www.w3.org/ns/solid/terms#> .
@prefix cps: </settings/cp#> .

<#job-1>
    a solid:CronJob ;
    solid:schedule "0 0 * * *" ;
    solid:job </scripts/rss-update.js> ;
    # Optional. Specify the CP that performs the computation job.
    solid:cp cps:server-alternative .

This declares a cron job that runs at 00:00 everyday, which will execute the script /scripts/rss-update.js on the server specified by cps:server-alternative (which refers to the corresponding CP specified above).

On-demand computation

This should be specified by the App, and sent to the pod when requesting. An example of the specification sent to the pod could be:

@prefix solid: <http://www.w3.org/ns/solid/terms#> .

<#req>
    a solid:DelegatedComputationRequest ;
    solid:from "APP-IDENTIFIER" ;
    solid:computation: <#job> .

<#job>
    a solid:DelegatedComputation ;
    solid:jobSpec [
        a solid:ComputationJob;
        solid:job <Pointer-to-Some-Job-Specification>
        ] .

This identifies the App itself, and specifies the computation job as <Pointer-to-Some-Job-Specification>. The actual job should be specified by the App, and can not be assumed to exist on the user's pod. Maybe a standard protocol is needed to initialize the environment and perform the job.

Dynamic resource generation / Shadowing data

The specification file is located either under a specific folder (similar to that for event-based or scheduled jobs), or as a metadata to a resource (similar to how ACLs function). I would prefer the second way.

@prefix solid: <http://www.w3.org/ns/solid/terms#> .
@prefix cps: </settings/cp#> .

<#job-1>
    a solid:ComputationUponRequest ;
    solid:job </scripts/remove-name-from-input.js> ;
    solid:jobArgs ["/path/to/the/resource"] ;
    solid:targetResource </path/to/the/resource> .

This specifies a delegated computation job that will be triggered when accessing resource /path/to/the/resource. The job is </scripts/remove-name-from-input.js>, and its arguments is </path/to/the/resource>. The job removes “names” from the input (first argument), and produces outputs. The outputs are sent to anyone requesting resource /path/to/the/resource, instead of the original resource.

This could in principle also allow to dynamically generate resources (while not storing them to that place of the resource in the pod). For example, this slightly modified specification:

@prefix solid: <http://www.w3.org/ns/solid/terms#> .
@prefix cps: </settings/cp#> .

<#job-1>
    a solid:ComputationUponRequest ;
    solid:job </scripts/remove-name-from-input.js> ;
    solid:jobArgs ["/path/to/the/resource.ori"] ;
    solid:targetResource </path/to/the/resource> .

This is almost identical to the original one, but changed the argument. Thus, the job reads from resource /path/to/the/resource.ori, but the result is used as the resource /path/to/the/resource.

5. Questions / Problems

The major questions are similar to the questions/problems in #390. Most of the concepts / discussions can be directly borrowed from there. Most of the notation/language and protocol designs for #390 can be also used in this proposal. The questions inspired by #390 are:

  1. How to prevent abuse (in the CP)
  2. What are the security concerns when sending data to CPs?
  3. What should be the environment in the CPs?
    1. How to make the environment consistent in different CPs, and also customizable for different jobs?
    2. Maybe borrowing some ideas from the CIs/CDs?
  4. What could the API look like for the scripts?
  5. How and when should the user authorize the computation?

Note that with the introduction of separated computation providers, there are different implications for the server considerations and security/trust:

  1. Solid server does not need to consider abuse, because computation is performed on the CP; of course, the CP still need to consider abuse.
    1. The Solid server can choose not to act as a CP at all.
    2. The CP can have their own abuse policy, not constraint to the goals of Solid store. Different CPs can have different policies.
  2. The user needs to provide trust to the CP.
    1. Previously when computation is provided by the Solid server, this is not a big issue – the user must trust the Solid server otherwise data won't be stored there.
    2. Now, an explicit mechanism to choose and specify trust to the CPs is needed.

There are new questions for this design:

  1. How to authenticate the user with the CP? (By WebID obviously, but what protocol?) 
  2. What protocol change is needed for Apps to perform on-demand delegated computation?
  3. CP may have malicious behaviours, such as retaining user's data. 
    1. This is why the user needs to choose a trusted CP.
    2. This may be overcome by technologies like (secure) multi-party computation. (A member in our group has this as the research direction.)
  4. How to prevent Apps from over-using delegated computation, with hidden data retaining behaviours in the jobs?
    1. Maybe not a new issue because the computation job is either performed in an App or a CP – if they want to do this, an App can also do this without the delegated computation.

6. Additional benefits and long-term considerations

(Secure) Multi-party computation

This (or something similar to this) is the prerequisite of enabling (secure) multi-party computation (MPC), in a trusted and generic manner (in terms of choosing the intermediate servers). The intermediate servers holding the (partially and/or augmented/manipulated) data can be called data intermediaries.

Currently, it is possible for each individual App to design their own mechanism for doing MPC. This relies on the trust of App developer and their specified data intermediaries. I wouldn't expect the Apps to provide “choices” of alternative data intermediaries – even if they want, because of the lack of generic protocol, this is not doable.

With this mechanism, there will be the standard protocol of sending jobs to data intermediaries, and the building of data intermediaries is separated from the developer of an App. With extension to the configuration, the user could specify the expected intermediaries in his/her setting files, and the App only need to follow the protocol specification to send computation jobs and receive results – very similar to the on-demand computation discussed above.

In addition, there could be some (but won't be all) standardized MPC jobs that the user and the App developers to choose from. This forms better transparency, accountability and therefore trust between them.

Data policy

The data policy is a long-term consideration in this setting. The ACL may be still enough yet, but is complicated if the delegated computation generates data – what policy the generated data should have? No / Default policy? Pre-specified by user as ACL? The same as the input data (if any)? Revised from the input data based on the computation performed?

This is not critical yet, but the limitation can be foreseen – what if multiple data / users are involved in the same job, and what if the data flow has multiple stages owned by different vendors? They are equally true for Apps and delegated computation. I would consider to introduce a policy language supporting dynamic changing and compliance checking for arbitrary DAGs. (This is my research background. Happy to discuss more. But it's deviated from this proposal.)

langsamu commented 2 years ago

Somewhat relevant: I've created a vocabulary and an engine for expressing and executing programming language independent expression trees.

View the ontology on WebVowl.

Though the vocabulary follows the System.Linq.Expressions information model of expression trees, and the engine is implemented in .NET, I think the ideas are transferrable.

For example, this graph

@prefix : <http://example.com/> .
@prefix xt: <http://example.com/ExpressionTypes/> .

:fibonacci 
    :parameterType [
        :typeName "System.Func`2" ;
        :typeArguments (
            [
                :typeName "System.Int64" ;
            ]
            [
                :typeName "System.Int64" ;
            ]
        ) ;
    ] ;
.

:n 
    :parameterType [
        :typeName "System.Int64" ;
    ] ;
.

:s
    :blockVariables (
        :fibonacci
    ) ;
    :blockExpressions (
        [
            :binaryExpressionType xt:Assign ;
            :binaryLeft :fibonacci ;
            :binaryRight [
                :lambdaParameters (
                    :n
                ) ;
                :lambdaBody [
                    :conditionTest [
                        :binaryExpressionType xt:LessThan ;
                        :binaryLeft :n ;
                        :binaryRight [
                            :constantValue 2 ;
                        ] ;
                    ] ;
                    :conditionIfTrue :n ;
                    :conditionIfFalse [
                        :binaryExpressionType xt:Add ;
                        :binaryLeft [
                            :invokeExpression :fibonacci ;
                            :invokeArguments (
                                [
                                    :binaryExpressionType xt:Subtract ;
                                    :binaryLeft :n ;
                                    :binaryRight [
                                        :constantValue 2 ;
                                    ] ;
                                ]
                            ) ;
                        ] ;
                        :binaryRight [
                            :invokeExpression :fibonacci ;
                            :invokeArguments (
                                [
                                    :unaryExpressionType xt:Decrement ;
                                    :unaryOperand :n ;
                                ]
                            ) ;
                        ] ;
                    ] ;
                ] ;
            ] ;
        ]
        [
            :invokeExpression :fibonacci ;
            :invokeArguments (
                [
                    :constantValue 8 ;
                ]
            ) ;
        ]
    ) ;
.

represents the same expression tree as this code

let fibonacci = n => {
    if (n < 2)
        return n
    else
        return fibonacci(n - 2) + fibonacci(--n)
}

fibonacci(8)
kjetilk commented 2 years ago

I haven't had time to read the full proposal, but just to acknowledge that it is an important topic that hasn't yet found a good home, I created a label for such issues. We clearly need to support computation over pod data in sophisticated ways and even though most of the intelligence is assumed to be on the client side, there needs to be support in the cases where this is not practical.

renyuneyun commented 2 years ago

@langsamu Wow. That sounds like a nice starting point of passing computation jobs to CPs. Is it just-another-way of presenting the code, or is it language-agnostic? I see "block", "expression", etc in the example, so I presume it assumes some language features (which may not be available in other languages)?

Speaking more broadly, I could see several possibilities (with different pros and cons) on how to describe the computation jobs: directly passing code/scripts; #390 mentioned AWS Lambda; WebAssembly; yours (or something similar)... That inevitably leads to a question also mentioned in #390: some form of accounting of the job is needed. Free-form code is the most disadvantageous in that aspect. I do not know if your engine has any considerations in that aspect?

langsamu commented 2 years ago

Is it just-another-way of presenting the code, or is it language-agnostic? I see "block", "expression", etc in the example, so I presume it assumes some language features (which may not be available in other languages)?

The vocabulary is language agnostic, but it does follow a specific information model. The engine parses an RDF graph into an object graph of in-memory instances of that information model. That resulting object graph can be converted (by other means) into executable code.

An obvious alternative is source code generation from the RDF graph.

In either case there is, as you say, a dependency between the target language features used and the shape of the RDF graph.

some form of accounting of the job is needed. [...] I do not know if your engine has any considerations in that aspect?

None.