Input stream as source - Githubissues

dachafra commented 4 years ago

issue: not possible to describe an input stream as a source

suggestion: add support for describing input streams. This will facilitate the usage of RML in transformation pipelines.

marioscrock commented 3 years ago

Naive solution proposed here https://github.com/RMLio/rmlmapper-java/pull/34

However, my suggestion is to "shift" the problem to the RML processor allowing the user to define a custom logic to process an rml:source of the logical sources defined in the mapping file and build an abstraction to let the processor access the records. For example, in the case of the RMLMapper, defining a custom AccessFactory and/or Access. As an example, in Chimera we implemented an alternative AccessFactory that can access InputStreams but can also bypass the rml:source defined and apply the mappings to the entire message body https://github.com/cefriel/chimera/tree/master/chimera-rml/src/main/java/com/cefriel/chimera/rml).

Generally speaking, I think it is important to let users define mappings that are not bound to specific sources (e.g. to a specific file name).

dachafra commented 2 years ago

@DylanVanAssche is this already allowed with the LogicalSource spec?

pmaria commented 2 years ago

FYI: CARML supports this with an extension of rml:source https://github.com/carml/carml#input-stream-extension

DylanVanAssche commented 2 years ago

Spec is now available at https://github.com/kg-construct/rml-target-source-spec. Currently, only Targets are defined there, but can be easily extended to Sources as well.

FYI: CARML supports this with an extension of rml:source https://github.com/carml/carml#input-stream-extension

Great! This would be allowed in the Sources & Targets because CARML's extension is just a different part for rml:source and rmlt:target. It was so intended to support these things in the spec through DataIO spec: https://rml.io/specs/dataio. The RMLStreamer also have similar extensions: https://github.com/RMLio/RMLStreamer#processing-a-stream

My suggestion: we need to discuss to have an extension to something like the DataIO spec to define how a stream should look like in the mapping rules.

DylanVanAssche commented 2 years ago

Relevant ontology: https://github.com/streamreasoning/vois

Which stream sources do we want to support?

Kafka
MQTT
WebSocket
CoAP
Server Sent Events
TCP
...

Characteristics to consider:

Windows for joins
Specifying the content (JSON, XML, etc.) inside the stream
Restarts?
...

Please comment here which one definitely need to be supported of the sources & characteristics.

DylanVanAssche commented 2 years ago

I have been investigating this further and I think we can just 'extend' the Web of Things description :) Web of Things provides a way to extend their specification with flavor specific things with Web of Things Binding Templates: https://www.w3.org/TR/wot-binding-templates.

They have already binding templates for:

HTTP
CoAP
MQTT streams

My suggestion is to re-use Web of Things descriptions and define binding templates for other streams we may want as listed above. This way, we don't need to define this vocabulary ourselves, we re-use standardized specifications and are open for the future.

pmaria commented 2 years ago

Could you give an example what this would look like for a simple input stream/pipe of data passed at runtime?

DylanVanAssche commented 2 years ago

@pmaria

There are currently no binding templates for a pipe, but I don't see any requirements here that are specific to a pipe compared to the others, so we don't need a binding template.

We only need a proper URL with a schema defining a pipe such as tcp://, I don't know a scheme definition for pipes though. However, pipes are just files so named pipes are easily described as file:///path/to/named_pipe:

<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source [ a td:Thing;
      td:hasPropertyAffordance [ a td:PropertyAffordance;
          td:hasForm [
            # URL and content type
            hctl:hasTarget "file:///path/to/named_pipe";
            hctl:forContentType "application/json";
            # Read only
            hctl:hasOperationType td:readproperty;
          ];
       ];
    ];
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.station.[*]";
  ];

  # Subject, Predicate, Object Maps

At least this works fine under Linux where everything is a file, I'm curious how these things are handled on Windows & Mac.

pmaria commented 2 years ago

That honestly looks quite complex only to specify the location of a source. Don't you think? I mean hctl:forContentType and hctl:hasOperationType don't seem necessary for our case. That's already 2/3 of the description.

However, pipes are just files so named pipes are easily described as file:///path/to/named_pipe

Right, usually you would use an input stream for file IO, but it can also be used as an intermediary stream from one internally running process to another, in which case we're not dealing with files. But it could be an different type of object as well, e.g. a compiled JSON object.

So it would indeed be good to have some way of saying that the user will provide the source to the engine running the mapping as @marioscrock suggested in https://github.com/kg-construct/rml-target-source-spec/issues/2#issuecomment-1063752019 We would still need to define a way to specify that.

DylanVanAssche commented 2 years ago

That honestly looks quite complex only to specify the location of a source. Don't you think? I mean hctl:forContentType and hctl:hasOperationType don't seem necessary for our case. That's already 2/3 of the description.

Hmmmmm you're right. I copied them from the WoT paper, but in the end, you don't really need them since hctl:forContentType is not needed since we have rml:referenceFormulation for it. hctl:hasOperationType is also not really necessary. I checked the code in the RMLMapper and these are not even used :see_no_evil: Let's drop them.

WoT also allows security to be integrated there, but is not applicable here.

but it can also be used as an intermediary stream from one internally running process to another, in which case we're not dealing with files.

Ah you mean this Java InputStream thing... If we have a scheme, it would work:

<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source [ a td:Thing;
      td:hasPropertyAffordance [ a td:PropertyAffordance;
          td:hasForm [
            hctl:hasTarget "inputstream://$NAME";
          ];
       ];
    ];
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.station.[*]";
  ];

  # Subject, Predicate, Object Maps

If we want something 'simpler' we would need to define an ontology first (if we cannot re-use something existing). I rather don't want to maintain that TBH. However, rml:source is fully open especially for this. If you define a custom ontology like you did in CARML, you can just put that here. Problem with that ontology is that it only support InputStreams. I would welcome a better approach that covers more than just InputStreams and uses standardized ontologies as much as possible. I picked WoT here because WoT is currently in already for Web APIs & some streams and it seems 'easy' to support more types of streams thanks to the URL scheme, but that's my opinion. Open for suggestions!

bjdmeest commented 2 years ago

Am I right in understanding that currently directly using a (JAVA) inputstream is not possible with your suggested change @DylanVanAssche ? Of course, I guess this is very implementation-specific. @pmaria , in your experience, do you think there's another way to specify this? I guess it's tricky to support inputstream:// and then for the CG need to maintain some kind of mapping table that maps that to java.io.InputStream in a JAVA, to a stream.Readable in Node.js, etc. etc. Maybe it makes sense to keep that at the discretion of the engine implementation, and suggest to support a inputstream:// source, with the caveat for each engine to clearly document which implementation is meant by that?

pmaria commented 4 months ago

Am I right in understanding that currently directly using a (JAVA) inputstream is not possible with your suggested change @DylanVanAssche ? Of course, I guess this is very implementation-specific. @pmaria , in your experience, do you think there's another way to specify this? I guess it's tricky to support inputstream:// and then for the CG need to maintain some kind of mapping table that maps that to java.io.InputStream in a JAVA, to a stream.Readable in Node.js, etc. etc. Maybe it makes sense to keep that at the discretion of the engine implementation, and suggest to support a inputstream:// source, with the caveat for each engine to clearly document which implementation is meant by that?

@DylanVanAssche I'm currently looking into this again. It is still unclear to me how we would now write a source for this.

DylanVanAssche commented 4 months ago

@pmaria

It is still unclear to me how we would now write a source for this.

Could you please specify what is unclear to you here? I'm not sure we're on the same page.

pmaria commented 4 months ago

The use case is that you somehow obtain an input stream (I'm using Java terminology here, but in Python an example would be something like BytesIO ) and you want to be able to use that as source.

In java most data processing libraries that are used for the reference formulations accept input streams as a source of input. It is also relatively easy to serialize some object into a form of that matches a supported source type and convert that to an input stream.

This makes it possible to easily integrate an RML processor as a part of a pipeline using any type of ETL approach. This is where I see a lot of users of CARML using https://github.com/carml/carml#input-stream-extension, for lack of a standard approach.

DylanVanAssche commented 4 months ago

This makes it possible to easily integrate an RML processor as a part of a pipeline using any type of ETL approach.

IMO, you only need to reference then that inputstream A is linked to Logical Source X with Target that points to inputstream A.

Ain't this enough then?

<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source [ a td:Thing;
      td:hasPropertyAffordance [ a td:PropertyAffordance;
          td:hasForm [
            hctl:hasTarget "inputstream://A";
          ];
       ];
    ];
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.station.[*]";
  ];

When you call your RML engine in your code and pass through the inputstreams you have a Map that says inputstream://A == Java.Inputstream instance.

Or am I missing something crucial here?

pmaria commented 4 months ago

This makes it possible to easily integrate an RML processor as a part of a pipeline using any type of ETL approach.

IMO, you only need to reference then that inputstream A is linked to Logical Source X with Target that points to inputstream A.

Ain't this enough then?
<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source [ a td:Thing;
      td:hasPropertyAffordance [ a td:PropertyAffordance;
          td:hasForm [
            hctl:hasTarget "inputstream://A";
          ];
       ];
    ];
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.station.[*]";
  ];
When you call your RML engine in your code and pass through the inputstreams you have a Map that says inputstream://A == Java.Inputstream instance.

Or am I missing something crucial here?

Well I am still familiarizing myself with the td: and hctl: vocabs, but the definition of

hctl:Form :

target IRI of a link or submission target of a form.

Does not really seem to fit this use case.

Furthermore I don't know if we can consider "inputstream://" a standard notation. Or would this be something that is use case dependent and should an engine be provided with more information as to the type of target being specified?

DylanVanAssche commented 4 months ago

target IRI of a link or submission target of a form. Does not really seem to fit this use case.

The W3C Web Of Things are made for streams, but the ontology seems to be a bit challenging to grasp, like this here.

Furthermore I don't know if we can consider "inputstream://" a standard notation. Or would this be something that is use case dependent and should an engine be provided with more information as to the type of target being specified?

Well I consider this InputStream also yet-another-data-format which needs to be described in a RML IO Registry description. We could also introduce another class like with did for relative paths, but we need to decide then which approach we take and describe it. RML IO itself in an abstract form, has no issues with streams IMO.

pmaria commented 4 months ago

We could also introduce another class like with did for relative paths, but we need to decide then which approach we take and describe it.

Thinking a bit further on this, I think what we really need here is just a way to indicate that a source is provided to the engine programmatically. Similar to what @marioscrock describes in https://github.com/kg-construct/rml-io/issues/2#issuecomment-1063752019.

This could be very basic, and maybe all that we need extra for this would be to be able to name the provided source.

So something like

[] rml:logicalSource [
  rml:source [
    a rml:ProvidedSource ;
    rml:name "some identifying name" ; # optional?
  ] ;
] ;

An engine can then expose an API to be able to provide this source. The rationale to keep this so simple is that I think that it will be difficult to standardize this accross programming languages.

Furthermore, this opens the possibility to be able to support not only input-stream-like sources, but also other objects, like e.g. an already deserialized JSON node or XML node etc.

DylanVanAssche commented 4 months ago

Thinking a bit further on this, I think what we really need here is just a way to indicate that a source is provided to the engine programmatically.

Well that's what I wanted to try with the example above. We're on the same page it seems :) I just wanted to have the idea you also present with a rml:ProvidedSource. Name should not be optional otherwise it is going to be hard to link them programmatically (and possibly introducing engine-specific ways which we want to avoid).

Furthermore, this opens the possibility to be able to support not only input-stream-like sources, but also other objects, like e.g. an already deserialized JSON node or XML node etc

+1

pmaria commented 4 months ago

Thinking a bit further on this, I think what we really need here is just a way to indicate that a source is provided to the engine programmatically.

Well that's what I wanted to try with the example above. We're on the same page it seems :) I just wanted to have the idea you also present with a rml:ProvidedSource. Name should not be optional otherwise it is going to be hard to link them programmatically (and possibly introducing engine-specific ways which we want to avoid).

Furthermore, this opens the possibility to be able to support not only input-stream-like sources, but also other objects, like e.g. an already deserialized JSON node or XML node etc

+1

OK. I think the hypermedia/wot vocabs don't really fit for this purpose. So how do you feel about rml:ProvidedSource with a property like rml:sourceIdentifier? Can we draft up something like that?

DylanVanAssche commented 4 months ago

+1 for drafting something up :)

pmaria commented 4 months ago

Ok, I will make a PR

kg-construct / rml-io

Input stream as source #2