Open renyuneyun opened 2 years ago
Somewhat relevant: I've created a vocabulary and an engine for expressing and executing programming language independent expression trees.
View the ontology on WebVowl.
Though the vocabulary follows the System.Linq.Expressions information model of expression trees, and the engine is implemented in .NET, I think the ideas are transferrable.
For example, this graph
@prefix : <http://example.com/> .
@prefix xt: <http://example.com/ExpressionTypes/> .
:fibonacci
:parameterType [
:typeName "System.Func`2" ;
:typeArguments (
[
:typeName "System.Int64" ;
]
[
:typeName "System.Int64" ;
]
) ;
] ;
.
:n
:parameterType [
:typeName "System.Int64" ;
] ;
.
:s
:blockVariables (
:fibonacci
) ;
:blockExpressions (
[
:binaryExpressionType xt:Assign ;
:binaryLeft :fibonacci ;
:binaryRight [
:lambdaParameters (
:n
) ;
:lambdaBody [
:conditionTest [
:binaryExpressionType xt:LessThan ;
:binaryLeft :n ;
:binaryRight [
:constantValue 2 ;
] ;
] ;
:conditionIfTrue :n ;
:conditionIfFalse [
:binaryExpressionType xt:Add ;
:binaryLeft [
:invokeExpression :fibonacci ;
:invokeArguments (
[
:binaryExpressionType xt:Subtract ;
:binaryLeft :n ;
:binaryRight [
:constantValue 2 ;
] ;
]
) ;
] ;
:binaryRight [
:invokeExpression :fibonacci ;
:invokeArguments (
[
:unaryExpressionType xt:Decrement ;
:unaryOperand :n ;
]
) ;
] ;
] ;
] ;
] ;
]
[
:invokeExpression :fibonacci ;
:invokeArguments (
[
:constantValue 8 ;
]
) ;
]
) ;
.
represents the same expression tree as this code
let fibonacci = n => {
if (n < 2)
return n
else
return fibonacci(n - 2) + fibonacci(--n)
}
fibonacci(8)
I haven't had time to read the full proposal, but just to acknowledge that it is an important topic that hasn't yet found a good home, I created a label for such issues. We clearly need to support computation over pod data in sophisticated ways and even though most of the intelligence is assumed to be on the client side, there needs to be support in the cases where this is not practical.
@langsamu Wow. That sounds like a nice starting point of passing computation jobs to CPs. Is it just-another-way of presenting the code, or is it language-agnostic? I see "block", "expression", etc in the example, so I presume it assumes some language features (which may not be available in other languages)?
Speaking more broadly, I could see several possibilities (with different pros and cons) on how to describe the computation jobs: directly passing code/scripts; #390 mentioned AWS Lambda; WebAssembly; yours (or something similar)... That inevitably leads to a question also mentioned in #390: some form of accounting of the job is needed. Free-form code is the most disadvantageous in that aspect. I do not know if your engine has any considerations in that aspect?
Is it just-another-way of presenting the code, or is it language-agnostic? I see "block", "expression", etc in the example, so I presume it assumes some language features (which may not be available in other languages)?
The vocabulary is language agnostic, but it does follow a specific information model. The engine parses an RDF graph into an object graph of in-memory instances of that information model. That resulting object graph can be converted (by other means) into executable code.
An obvious alternative is source code generation from the RDF graph.
In either case there is, as you say, a dependency between the target language features used and the shape of the RDF graph.
some form of accounting of the job is needed. [...] I do not know if your engine has any considerations in that aspect?
None.
Currently, Solid provides storage and Solid Apps provide computation locally. If a user or an App want to perform server-side computation (e.g. those illustrated in #390), it must be done behind their own solutions and/or own server. This is not transparent and not accountable, and is a limitation of current mechanism.
390 gives one potential direction for that. This proposal targets at slightly different problems, thus coming up with different solutions. Compare with #390, this proposal focuses more on flexibility, and could provide more capabilities. It reduces some problems in #390, but brings alternative ones.
Note: this proposal becomes quite lengthy. You may read “1. Description” and “2. Computation initiation points & use cases” first to get the general idea. You may want to skip “3. Mechanisms and interactions” and ”6. Additional benefits and long-term considerations".
1. Description
This design could be called delegated computation.
Similar to #390, this design expect a standard way to use the server to perform computation. Different from #390, this design does not assume only the Solid server could do the computation, but any trusted and/or compatible servers.
We could call a server providing computation the computation provider (CP), the body triggered the computation the computation trigger (CT), and the pod which has the configuration and the data. The Solid Apps also sometimes play a role in the design.
The next section presents the points where the delegated computation could happen / be initiated, and then present use cases to illustrate the benefits this proposal brings.
2. Computation initiation points & use cases
There are mainly three types of uses for the delegated computation:
The data flows in these cases are not completely the same, but are from a basic pattern. This is explained in the next section.
Use cases
At the baseline, this proposal is a modular version of #390, which separated the pod (Solid server) and the computation provider.
But it also enables other use cases. Here illustrates the major ones:
3. Mechanisms and interactions
As mentioned earlier, there are three types of computation initiation points. They have different interactions and data flows. But in general they all follow this pattern:
Depending on the type, they are more specifically described in their own subsections.
Event-based or scheduled jobs
In this setting, the pod also acts as the CT, and potentially CP too if taking the design in #390. In fact, this is very similar to #390, with the only addition of the ability to specify CPs.
The trigger is specified by the user, similar to that as proposed in #390. In addition to #390, the trigger can additionally specify the desired CP. There are several possible designs, depending on two factors:
/settings/prefs
)?Therefore, the data flow is: pod → CP → pod.
On-demand computation
In this setting, the App acts as the CT. The user does not need to do anything beforehand.
At a certain stage in the App (probably before requiring some sensitive data), the App triggers the delegated computation. The next is the same as above, except that the App sends the computation job to the CP in this setting. The App (as the CT) receives the status handle, and requests the result data at the appropriate time (e.g. when the job finishes).
There is a design question on where the App receives the result data from: the pod or the CP? That involves two different types of data flows:
Dynamic resource generation / Shadowing data
In this setting, the pod acts as the CT. The user need to specify the trigger and computation job beforehand.
One main usage of this is to allow the user to specify pre-processing (as a computation job) of data. So when an App requires the data, the computation job is launched first, and the data request is on hold. After finishing (or via streaming), the result data (from the computation job) is sent to the App instead of the original data stored in the pod.
This is why it's call shadowing: the original data is shadowed, and the replacement (pre-processed) data is returned instead.
This is essentially the dynamic generation of resources – “dynamic” when “requesting”. The target resource and the original resource do not have to be the same resource.
(One of my personal projects projfs can be a similar (but not exactly the same) demonstration of the possibility.)
Therefore, this is different from event-based or scheduled jobs (also #390), in the way how the result data is dealt with:
This is also different from on-demand computation, because:
To fulfill the transparency, the data flow should be: pod → CP → pod → App.
Caching may be needed.
This can enabled additional usages, such as mirroring of data. (Alternative mechanisms may be introduced as well for this specific purpose, e.g.
sameAs
.)4. Configuration example
The configuration contains two parts:
Remember the two different choices for them mentioned earlier:
/settings/prefs
)?In the example, we assume the CPs are specified in configuration file, and is as URLs. Assume the relevant terms/classes are under
solid
prefix.Example CP configuration file
Assume the file is
/settings/cp
:This specifies two servers:
<#server-default>
and<#server-alternative>
; it also sets<#server-default>
as the default CP – the CP used when no explicit CP is specified in the job-side configuration.Job-side configuration
These are the specification of the computation jobs. They can optionally specify the CP instead of the default one.
The schema/ontology is for illustrative purposes. It needs to be carefully designed afterwards.
Event-based or scheduled jobs
This is similar to #390, so I'm borrowing and adapting the example from there. The file is located under a specific folder that the pod recognizes and triggers jobs.
This declares a cron job that runs at 00:00 everyday, which will execute the script
/scripts/rss-update.js
on the server specified bycps:server-alternative
(which refers to the corresponding CP specified above).On-demand computation
This should be specified by the App, and sent to the pod when requesting. An example of the specification sent to the pod could be:
This identifies the App itself, and specifies the computation job as
<Pointer-to-Some-Job-Specification>
. The actual job should be specified by the App, and can not be assumed to exist on the user's pod. Maybe a standard protocol is needed to initialize the environment and perform the job.Dynamic resource generation / Shadowing data
The specification file is located either under a specific folder (similar to that for event-based or scheduled jobs), or as a metadata to a resource (similar to how ACLs function). I would prefer the second way.
This specifies a delegated computation job that will be triggered when accessing resource
/path/to/the/resource
. The job is</scripts/remove-name-from-input.js>
, and its arguments is</path/to/the/resource>
. The job removes “names” from the input (first argument), and produces outputs. The outputs are sent to anyone requesting resource/path/to/the/resource
, instead of the original resource.This could in principle also allow to dynamically generate resources (while not storing them to that place of the resource in the pod). For example, this slightly modified specification:
This is almost identical to the original one, but changed the argument. Thus, the job reads from resource
/path/to/the/resource.ori
, but the result is used as the resource/path/to/the/resource
.5. Questions / Problems
The major questions are similar to the questions/problems in #390. Most of the concepts / discussions can be directly borrowed from there. Most of the notation/language and protocol designs for #390 can be also used in this proposal. The questions inspired by #390 are:
Note that with the introduction of separated computation providers, there are different implications for the server considerations and security/trust:
There are new questions for this design:
6. Additional benefits and long-term considerations
(Secure) Multi-party computation
This (or something similar to this) is the prerequisite of enabling (secure) multi-party computation (MPC), in a trusted and generic manner (in terms of choosing the intermediate servers). The intermediate servers holding the (partially and/or augmented/manipulated) data can be called data intermediaries.
Currently, it is possible for each individual App to design their own mechanism for doing MPC. This relies on the trust of App developer and their specified data intermediaries. I wouldn't expect the Apps to provide “choices” of alternative data intermediaries – even if they want, because of the lack of generic protocol, this is not doable.
With this mechanism, there will be the standard protocol of sending jobs to data intermediaries, and the building of data intermediaries is separated from the developer of an App. With extension to the configuration, the user could specify the expected intermediaries in his/her setting files, and the App only need to follow the protocol specification to send computation jobs and receive results – very similar to the on-demand computation discussed above.
In addition, there could be some (but won't be all) standardized MPC jobs that the user and the App developers to choose from. This forms better transparency, accountability and therefore trust between them.
Data policy
The data policy is a long-term consideration in this setting. The ACL may be still enough yet, but is complicated if the delegated computation generates data – what policy the generated data should have? No / Default policy? Pre-specified by user as ACL? The same as the input data (if any)? Revised from the input data based on the computation performed?
This is not critical yet, but the limitation can be foreseen – what if multiple data / users are involved in the same job, and what if the data flow has multiple stages owned by different vendors? They are equally true for Apps and delegated computation. I would consider to introduce a policy language supporting dynamic changing and compliance checking for arbitrary DAGs. (This is my research background. Happy to discuss more. But it's deviated from this proposal.)