segmentio / analytics-next

Segment Analytics.js 2.0
https://segment.com/docs/connections/sources/catalog/libraries/website/javascript
MIT License
403 stars 136 forks source link

[Bug] Analytics JS messageId generation not always unique (hash collision) #1028

Closed andrewloyola closed 2 months ago

andrewloyola commented 8 months ago

https://github.com/segmentio/analytics-next/blob/master/packages/browser/src/core/events/index.ts#L264

From our webapp, we track page views and events via segment, and in our warehouse destinations we've seen that unique messages sometimes have the same messageId, which makes it update the record at that messageId in our warehouse, instead of treating it as a separate event (which it is).

It looks like the related code in this repo is here https://github.com/segmentio/analytics-next/blob/master/packages/browser/src/core/events/index.ts#L264.

The hash() function usage here hashes the event AND the uuid together, which looks like it mostly nullifies the use of uuid() in this context by reducing the entropy, since the hash lends itself to collisions with the volume of events being put through this function over time, on the same pages that have largely the same event metadata.

Suggested fix is to only hash the message itself, and not the uuid. In general, this messageId field is documented in segement docs like it is supposed to be unique per message after 24 hours, but is not guaranteed to be unique based on the code implementation, so the warehouse behavior https://segment.com/docs/guides/duplicate-data/#warehouse-deduplication feels like it has a good chance of causing problems eventually the one reported here.

related PR: https://github.com/segmentio/analytics-next/pull/1027 (Optionally, omitting the hash completely, and just using the uuid by itself probably fulfills the uniqueness need, so that could be changed instead).

related reference for md5 collisions https://atsec-information-security.blogspot.com/2020/04/rise-fall-of-md5.html

silesky commented 8 months ago

@andrewloyola Thanks. This a good find. That's a big no-no.

Background: because of requirements around browser support, the simple uuid is not safe -- which is why hashing was added to begin with.

I think we're interested in ripping the band-aid off and removing the md5 dependency, and just doing: https://github.com/segmentio/analytics-next/blob/master/packages/node/src/lib/get-message-id.ts

(Appending the current epoch date should add safety. Our thinking is that the odds that any given hash collision would happen at the same time, down to the ms, is much less (We haven't had any complaints yet on node)).

andrewloyola commented 8 months ago

ah yes that makes sense, and glad there is already a pattern to do so. The epoch date solution should cover the cases we've seen at least. +1 support on band-aid ripping.