Closed ryancrawcour closed 3 years ago
Config -
batch.size Batch size between 1 (dedicated PutItemRequest for each record) and 25 (which is the maximum number of items in a BatchWriteItemRequest)
Type: int Default: 1 Importance: high
The batch size is defined using the property connect.cosmosdb..task.batch.size and it is used to set the MaxItemCount in the Change Feed options. changeFeedOptions.setMaxItemCount(setting.batchSize).
must ensure reading from cosmos db in a batch, and writing to kafka, in a batch, is supported in new Java code.
related #148 and #8
Officially by using KafkaProducer and producerRecord you can't do that, but you can do this by configuring some properties in ProducerConfig
batch.size - from document producer batch up the records into requests that are sending to same partition and send them at once
The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This helps performance on both the client and the server. This configuration controls the default batch size in bytes. No attempt will be made to batch records larger than this size.
some more information on producer client and batching - https://aiokafka.readthedocs.io/en/stable/producer.html
does Cosmos DB support receiving a batch of items from the ChangeFeed at once? as Marcel says above there is MaxItemCount that can be used, but is that buffering internally, or setting how many items to return from the ChangeFeed for each poll interval?
if we do batching of Cosmos DB messages, what do we do with the checkpoint and watermarks? eg. Cosmos could send us 5 messages, but we're configured to only write 10 in a batch. is there any advantage in doing this?
the disadvantage is that if the connector fails on receiving the 6th message, before it has flushed anything to kafka, we will lose the currently buffered 5 messages and cosmos db will think it has already given them to the connector, so won't give them again.
for this first pass we will park batch support and come back to it later.
When reading from Cosmos DB and writing to Kafka batching should be the default behaviour Batch size should be defaulted to a chosen value, but be configurable by user.