microsoft / kafka-connect-cosmosdb

Kafka Connect connectors for Azure Cosmos DB
MIT License
47 stars 54 forks source link

Must have support for batching when writing to Kafka #9

Closed ryancrawcour closed 3 years ago

ryancrawcour commented 5 years ago

When reading from Cosmos DB and writing to Kafka batching should be the default behaviour Batch size should be defaulted to a chosen value, but be configurable by user.

ryancrawcour commented 5 years ago

Config -

batch.size Batch size between 1 (dedicated PutItemRequest for each record) and 25 (which is the maximum number of items in a BatchWriteItemRequest)

Type: int Default: 1 Importance: high

marcelaldecoa commented 5 years ago

The batch size is defined using the property connect.cosmosdb..task.batch.size and it is used to set the MaxItemCount in the Change Feed options. changeFeedOptions.setMaxItemCount(setting.batchSize).

ryancrawcour commented 4 years ago

must ensure reading from cosmos db in a batch, and writing to kafka, in a batch, is supported in new Java code.

ryancrawcour commented 3 years ago

related #148 and #8

ryancrawcour commented 3 years ago

Officially by using KafkaProducer and producerRecord you can't do that, but you can do this by configuring some properties in ProducerConfig

batch.size - from document producer batch up the records into requests that are sending to same partition and send them at once

The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This helps performance on both the client and the server. This configuration controls the default batch size in bytes. No attempt will be made to batch records larger than this size.

ryancrawcour commented 3 years ago

some more information on producer client and batching - https://aiokafka.readthedocs.io/en/stable/producer.html

ryancrawcour commented 3 years ago

does Cosmos DB support receiving a batch of items from the ChangeFeed at once? as Marcel says above there is MaxItemCount that can be used, but is that buffering internally, or setting how many items to return from the ChangeFeed for each poll interval?

if we do batching of Cosmos DB messages, what do we do with the checkpoint and watermarks? eg. Cosmos could send us 5 messages, but we're configured to only write 10 in a batch. is there any advantage in doing this?

the disadvantage is that if the connector fails on receiving the 6th message, before it has flushed anything to kafka, we will lose the currently buffered 5 messages and cosmos db will think it has already given them to the connector, so won't give them again.

ryancrawcour commented 3 years ago

for this first pass we will park batch support and come back to it later.