Open AkhtemWays opened 1 year ago
Hi @AkhtemWays this is an interesting topic of discussion, which I had not considered before. Can you give any more details about how often you are seeing duplicate IDs? How do you know the duplicates IDs were created by enrich, and not by the trackers the sent the events?
wanted to know the reason for going towards UUID strategy.
I can't answer this, because the design decision pre-dates when I joined Snowplow!
But... I had never considered it a bad decision. To the best of my understanding, the Java implementation of UUID.randomUUID()
is extremely unlikely to generate duplicates.
We do often see duplicate event IDs downstream of enrich. But in our experience those duplicates arise from either:
When I executed group by query by event_id and ordered by counts I found that at most 45 same event_ids in DWH. None of the versions of UUIDs actually guarantee total uniqueness so far as I understood. It affects joins when the data goes further down to other data sources, where joins are happening, and at this point we are forced to join by many fields which kind of degrades the performance of queries and CPU load overall. If the primary goal is the generation of absolutely universally unique IDs, then Twitter Snowflake strategy would a good choice I suppose, the algorithm is tied ensures unique id generation because it's tied to unix_timestamp + machine_id that generates it, which basically states that at one point in time one machine can generate only one ID, if the script generates ids for multiple objects at the same time, the solution could be to add some sleeping of 1 nano second or some other strategy to fix it. This I think would solve the three problems you described and uniqueness problem overall. One configurable env parameter could be added to specify machine_id I guess.
I am not surprised that you see duplicate events in the DWH. But I think you are looking in the wrong place for the problem if you think it's because of our UUID generator.
If you find two events with the same event id, then interesting questions to look at next are:
collector_tstamp
? If yes, then this is probably a full duplicate copy of the same single event that was received by the collector.dvce_created_tstamp
? If yes (but collector_tstamp
different) then this is probably a duplicate copy of the same single event which a tracker sent multiple times to the collector, e.g. because of a network failure.If you investigate further the duplicate IDs in your DWH I am sure you will find there are other explanations, unrelated to how we generate UUIDs.
To add to @istreeter comments above - although it is possible to get event id duplicates it is generally rare to see genuine collisions unless duplicates are being sent.
We are unlikely to introduce any technology (e.g., Twitter Snowflake) to produce truly globally unique ids as this is very computationally expensive and we cannot rely on sources of server information (e.g., worker and shard numbers) originating from the client.
Project: snowplow/enricher/common Version: master ( latest ) Expected behavior: Generate universally unique ID across entire pipeline every time. Actual behavior: Generates duplicates event_id columns on enrichment stage Steps to reproduce: Create enriched event using class com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala using method setupEnrichedEvent, and proceed with the case when EnrichedEvent is returned.