EventQueue in SoapSourceTask.poll()

marcelboldt commented 3 years ago

Currently poll() is implemented in a way that serialises multiple requests which is quite inefficient. It may be explored to use an event queue to run the requests more parallelised.

ogomezso commented 3 years ago

Take in count that we will need to preserve the order within the same request type.

marcelboldt commented 3 years ago

I'd assume that this issue wouldn't affect ordering as long as there is only one client per request type.

What may be missing is rather a way to more explicitly determine order through configuration. I see several ways order within a request topic could be defined:

the correct order is how data is served from the SOAP service, i.e. it's the order of requests arriving on the server's endpoint.
there is explicit ordering
- how is it defined? timestamp, ascending id,...?
- where is it defined? In the SOAP header / body, HTTP header? Of the request or response?

In Kafka there is implicit order through the relative position of a record in the file on disk which is the same order that is also reflected through offset per partition. Additionally a timestamp is stored with a record - default is event/producer time (the producer sets the timestamp to the send time), there are also ingestion/broker-time (when the broker received the record), processing-time (current time), payload-time (explicit timestamp definition in the record).

kafka / SOAP	offset per partition	event/producer time	ingestion/broker time	processing/current time	payload-defined time
request order (impl.)
response order (impl.)
asc id in SOAP data
asc id in HTTP header
timestamp in request's SOAP data
timestamp in response's SOAP data
timestamp in request HTTP header
timestamp in response HTTP header

Firstly, I would tend to excluding HTTP header based information a the w3c soap definition wants all information a soap application acts upon in the SOAP message; SOAP is also valid if transferred via non-http protocols.

marcelboldt commented 3 years ago

For the implicit ordering it may be implemented to preserver this order via idempotent producers, to ensure the order is kept. Question: is this order meaningful at all, or more or less arbitrarily determined by network latency between SOAP client and endpoint?

As for time-based ordering based on a timestamp info in the SOAP data: the Kafka header's timestamp field can be set to this - the fact that order is unclear if the timestamp is the same among messages should be acceptable.

Asc id: the producer would have to buffer and order the messages. What if there is a gap within the asc if of incoming data - how long should the producer keep the records and wait if subsequent records fill the gap?

marcelboldt commented 3 years ago

@ogomezso Happy to read your feedback on this. How do soap systems usually handle / expect order? Do they have a notion, or is a SOAP record considered self-sufficient and services should be stateless?

marcelboldt commented 3 years ago

With regards to using data contained in the SOAP data: Once started to process data contained in a SOAP message the processor becomes a SOAP node which would have to process in particular the soap header as determined in the SOAP specification chapter 2, especially https://www.w3.org/TR/soap12-part1/Overview.html#relaysoapmsg. The processing would have to be according to an explicit SOAP protocol binding: https://www.w3.org/TR/soap12-part1/Overview.html#transpbindframew

That's a big decision... While it could make sense to create a Kafka Protocol Binding for SOAP I tend to think that this is exceeding the scope of this activity (at least for now). My preferred alternative would be not to touch SOAP data - what remains is to keep the implicit order (line 1-2 in the table) if it turns out that it isn't randomly based on network characteristics.

ogomezso / kafka-connect-soap

EventQueue in SoapSourceTask.poll() #6