As a citizen integrator, specify data type information in an integration flow

kcbabo commented 6 years ago

As a citizen integrator, I sometimes use connections within an integration flow that do not have a defined data model for input and/or output data. Examples of some 'typeless' connections used by my team are FTP, Amazon S3, JMS, and APIs which do not define input/output types in their Swagger definition. The issue with these typeless connections is that type-aware functionality in Syndesis (e.g. data mapper, basic filter) is not available because the types are not known. To address this issue, I want the ability to define the data type at any point in an integration where it is not known.

To address this requirement, I would like the ability to add a "Describe Data" step to an integration flow that declares the current data type. If a data type is already declared by a connector or step, I cannot change that data type with the Describe Data step. The configuration of the step should allow me to select from the following data type descriptions:

JSON schema
JSON instance document
XML schema
XML instance document

I anticipate using a Describe Data step in the following situations:

A step after a typeless start connection, in order to declare the type of data that starts a flow
A step before a typeless step connection, in order to declare the type of data that a step connection expects
A step after a typeless step connection, in order to declare the type of data returned from a step connection
A step before a typeless finish connection, in order to declare the type of data that a finish connection expects

chirino commented 6 years ago

Let me come up with some scenarios:

Say you have a typeless start connection, and typeless end connection. You could add "Describe Data" step there, but it would not really be required right? It has no runtime processing associated with it? What if the data being passed is not actually what was described?
Say you have a typeless start connection, and typeless end connection. Could you add 2 "Describe Data" steps and a Mapper in between those to do a data mapping between typeless connections?

kcbabo commented 6 years ago

Excellent scenarios!

Say you have a typeless start connection, and typeless end connection. You could add "Describe Data" step there, but it would not really be required right?

That's right. Syndesis should be able to process typeless data - you only need to add a data type if you want to use type-aware features of Syndesis.

It has no runtime processing associated with it?

Correct - it's only metadata.

What if the data being passed is not actually what was described?

That's on the user, IMO.

Say you have a typeless start connection, and typeless end connection. Could you add 2 "Describe Data" steps and a Mapper in between those to do a data mapping between typeless connections?

'Zactly!

kahboom commented 6 years ago

@chirino What if the data being passed is not actually what was described? @kcbabo That's on the user, IMO.

Maybe would be good to let the user know there is a discrepancy, in case they did it in error.

kcbabo commented 6 years ago

Agree in principle, but very tough in practice IMO. Can't really do this at the time the integration flow is created because you need the actually data that would be received at runtime. Might be able to log something at runtime, but that could be expensive from a performance standpoint on the happy path. But if there's a easy/reliable/performant way of doing it, then I'm all for it! :-)

lburgazzoli commented 6 years ago

About reporting, we could eventually leverage camel's health checks and add a pass-through processor to the flow that checks and report type correctness, then you can choose if you want this check to be included in application's health endpoint or just report it.

kcbabo commented 6 years ago

Another option would be to allow the user to decide if they want to validate the payload at runtime as part of the Describe Data step in the flow. It should be disabled by default, but when enabled it performs a validation based on the data definition.

Schema validation is crazy expensive, so this would have a significant impact on performance. There's also the issue of how you validate against an instance document.

Overall, I think it's an interesting idea as a future feature, but don't view it as critical functionality for a first pass.

dhirajsb commented 6 years ago

Schema validation is expensive, but there could be cheaper validation actions like content type checks that might make sense. For example, check that a JSON service is getting JSON request payload, and not XML.

syndesisio / syndesis-project

As a citizen integrator, specify data type information in an integration flow #182