The purpose here is that when a record_id is present, we want to use that as the document _id in Elasticsearch. This means that the document can be updated as new messages containing updates arrive.
However, not all datasets require updates. For some, each document is only seen once and needs to only ever be indexed. In these cases, we don't want to provide an _id to Elasticsearch. This is a best practice, directly recommended in the ES docs:
When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.
It speeds up indexing performance due to not needing to check for the existence of the ID before indexing. Link to ES docs here
The specific issue I see is that when my messages do not contain a record_id field, Connect falls back to the default ${!counter()}-${!timestamp_unix()} which is documented here. This _id is being generated by Connect.
I can't see any way to avoid this. I've tried setting the id explicitly to nil, null and "" but all of these result in a single document in ES with the ID of "null", and all messages overwrite that single document.
My request is to change the default behaviour such that instead of falling back to ${!counter()}-${!timestamp_unix()} as a default _id, Connect instead by default provides no _id and allows ES to generate one?
Hey there, I'm using Connect to sink documents from kafka topics to Elasticsearch. I've got a config that looks something like the following:
The purpose here is that when a
record_id
is present, we want to use that as the document_id
in Elasticsearch. This means that the document can be updated as new messages containing updates arrive.However, not all datasets require updates. For some, each document is only seen once and needs to only ever be indexed. In these cases, we don't want to provide an
_id
to Elasticsearch. This is a best practice, directly recommended in the ES docs:It speeds up indexing performance due to not needing to check for the existence of the ID before indexing. Link to ES docs here
The specific issue I see is that when my messages do not contain a
record_id
field, Connect falls back to the default${!counter()}-${!timestamp_unix()}
which is documented here. This _id is being generated by Connect.I can't see any way to avoid this. I've tried setting the
id
explicitly tonil
,null
and""
but all of these result in a single document in ES with the ID of "null", and all messages overwrite that single document.My request is to change the default behaviour such that instead of falling back to
${!counter()}-${!timestamp_unix()}
as a default_id
, Connect instead by default provides no_id
and allows ES to generate one?Thanks