Closed andrewloyola closed 2 months ago
@andrewloyola Thanks. This a good find. That's a big no-no.
Background: because of requirements around browser support, the simple uuid
is not safe -- which is why hashing was added to begin with.
I think we're interested in ripping the band-aid off and removing the md5
dependency, and just doing:
https://github.com/segmentio/analytics-next/blob/master/packages/node/src/lib/get-message-id.ts
(Appending the current epoch date should add safety. Our thinking is that the odds that any given hash collision would happen at the same time, down to the ms, is much less (We haven't had any complaints yet on node)).
ah yes that makes sense, and glad there is already a pattern to do so. The epoch date solution should cover the cases we've seen at least. +1 support on band-aid ripping.
https://github.com/segmentio/analytics-next/blob/master/packages/browser/src/core/events/index.ts#L264
From our webapp, we track page views and events via segment, and in our warehouse destinations we've seen that unique messages sometimes have the same
messageId
, which makes it update the record at that messageId in our warehouse, instead of treating it as a separate event (which it is).It looks like the related code in this repo is here https://github.com/segmentio/analytics-next/blob/master/packages/browser/src/core/events/index.ts#L264.
The
hash()
function usage here hashes the event AND the uuid together, which looks like it mostly nullifies the use ofuuid()
in this context by reducing the entropy, since the hash lends itself to collisions with the volume of events being put through this function over time, on the same pages that have largely the same event metadata.Suggested fix is to only hash the message itself, and not the uuid. In general, this messageId field is documented in segement docs like it is supposed to be unique per message after 24 hours, but is not guaranteed to be unique based on the code implementation, so the warehouse behavior https://segment.com/docs/guides/duplicate-data/#warehouse-deduplication feels like it has a good chance of causing problems eventually the one reported here.
related PR: https://github.com/segmentio/analytics-next/pull/1027 (Optionally, omitting the hash completely, and just using the uuid by itself probably fulfills the uniqueness need, so that could be changed instead).
related reference for md5 collisions https://atsec-information-security.blogspot.com/2020/04/rise-fall-of-md5.html