redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.14k stars 840 forks source link

Impossible to index Elasticsearch docs with no _id field, to leverage performance gains from ES generated IDs #3016

Open SeanBarry opened 3 days ago

SeanBarry commented 3 days ago

Hey there, I'm using Connect to sink documents from kafka topics to Elasticsearch. I've got a config that looks something like the following:

- switch:
    cases:
      - check: meta("record_id") != nil
        output: 
          elasticsearch:
            urls: ["${ELASTICSEARCH_ADDRESS}"]
            index: ${! meta("index_name") }
            id: ${! meta("record_id") }
            action: "index"
            ... other config ...
      - output: 
          elasticsearch:
            urls: ["${ELASTICSEARCH_ADDRESS}"]
            index: ${! meta("index_name") }
            action: "index"
            ... other config ...

The purpose here is that when a record_id is present, we want to use that as the document _id in Elasticsearch. This means that the document can be updated as new messages containing updates arrive.

However, not all datasets require updates. For some, each document is only seen once and needs to only ever be indexed. In these cases, we don't want to provide an _id to Elasticsearch. This is a best practice, directly recommended in the ES docs:

When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

It speeds up indexing performance due to not needing to check for the existence of the ID before indexing. Link to ES docs here

The specific issue I see is that when my messages do not contain a record_id field, Connect falls back to the default ${!counter()}-${!timestamp_unix()} which is documented here. This _id is being generated by Connect.

I can't see any way to avoid this. I've tried setting the id explicitly to nil, null and "" but all of these result in a single document in ES with the ID of "null", and all messages overwrite that single document.

My request is to change the default behaviour such that instead of falling back to ${!counter()}-${!timestamp_unix()} as a default _id, Connect instead by default provides no _id and allows ES to generate one?

Thanks