snowplow / enrich

Snowplow Enrichment jobs and library
https://snowplowanalytics.com
Other
21 stars 38 forks source link

event_id generation based on UUID principle #809

Open AkhtemWays opened 1 year ago

AkhtemWays commented 1 year ago

Project: snowplow/enricher/common Version: master ( latest ) Expected behavior: Generate universally unique ID across entire pipeline every time. Actual behavior: Generates duplicates event_id columns on enrichment stage Steps to reproduce: Create enriched event using class com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala using method setupEnrichedEvent, and proceed with the case when EnrichedEvent is returned.

  1. The problem with unique ID generation requires our team to build deduplication logic on our side after collector and enriched stages are done, and there doesn't seem to be an easy to fix it because of "at least once" policy of processing those events. I was curious about the reason of choosing UUID based event_id generation and absence of custom configuration. I also would want to propose using Twitter Snowflake strategy to create those IDs. The UUID strategy is mostly tied to MAC Address of the network interface, and Twitter Snowflake includes machine ID, which I think could resolve the issue of duplication. I might be wrong though, and wanted to know the reason for going towards UUID strategy.
istreeter commented 1 year ago

Hi @AkhtemWays this is an interesting topic of discussion, which I had not considered before. Can you give any more details about how often you are seeing duplicate IDs? How do you know the duplicates IDs were created by enrich, and not by the trackers the sent the events?

wanted to know the reason for going towards UUID strategy.

I can't answer this, because the design decision pre-dates when I joined Snowplow!

But... I had never considered it a bad decision. To the best of my understanding, the Java implementation of UUID.randomUUID() is extremely unlikely to generate duplicates.

We do often see duplicate event IDs downstream of enrich. But in our experience those duplicates arise from either:

AkhtemWays commented 1 year ago

When I executed group by query by event_id and ordered by counts I found that at most 45 same event_ids in DWH. None of the versions of UUIDs actually guarantee total uniqueness so far as I understood. It affects joins when the data goes further down to other data sources, where joins are happening, and at this point we are forced to join by many fields which kind of degrades the performance of queries and CPU load overall. If the primary goal is the generation of absolutely universally unique IDs, then Twitter Snowflake strategy would a good choice I suppose, the algorithm is tied ensures unique id generation because it's tied to unix_timestamp + machine_id that generates it, which basically states that at one point in time one machine can generate only one ID, if the script generates ids for multiple objects at the same time, the solution could be to add some sleeping of 1 nano second or some other strategy to fix it. This I think would solve the three problems you described and uniqueness problem overall. One configurable env parameter could be added to specify machine_id I guess.

istreeter commented 1 year ago

I am not surprised that you see duplicate events in the DWH. But I think you are looking in the wrong place for the problem if you think it's because of our UUID generator.

If you find two events with the same event id, then interesting questions to look at next are:

If you investigate further the duplicate IDs in your DWH I am sure you will find there are other explanations, unrelated to how we generate UUIDs.

miike commented 1 year ago

To add to @istreeter comments above - although it is possible to get event id duplicates it is generally rare to see genuine collisions unless duplicates are being sent.

We are unlikely to introduce any technology (e.g., Twitter Snowflake) to produce truly globally unique ids as this is very computationally expensive and we cannot rely on sources of server information (e.g., worker and shard numbers) originating from the client.