snowplow / snowplow-javascript-tracker

Snowplow event tracker for client-side and server-side JavaScript. Add analytics to your websites, web apps and servers.
http://snowplowanalytics.com
BSD 3-Clause "New" or "Revised" License
541 stars 220 forks source link

Replace Math.random to prevent duplicate event IDs #499

Open christoph-buente opened 8 years ago

christoph-buente commented 8 years ago

According to the definition of a V4 UUID, event_id is supposed to be unique for every event. Unfortunately we see the same event_id more than once in our collected data.

Our example data set has 13,406,731 events from which 13,235,475 have unique event_ids. This means that there are about 1.3% events that have duplicated event_id. These events may come in completely different times and from different ip's. As an example, there are 15 different events with event_id = "f2d4f27d-077a-4086-8cce-5789b10fbfda". See the graphics below.

pasted image at 2016_06_24 12_10 pm

We even have a larger data set with more recent JS tracker versions (2.5.x and 2.6.x), and the duplicate ratio is even worse ~= 1.8%

alexanderdean commented 8 years ago

Are you running the event fingerprint enrichment? Could you reshare the above table with the event fingerprint output as well? Trying to understand what type of duplicates these are.

christoph-buente commented 8 years ago

For this dataset, which is a bit older, we did not have that enrichment enabled. (If it even was available back then.) I'll check for the current SP stack, if we have it running.

But hashing the whole event does not make a difference. If time, IP address and user agent are different. It will most definitively result in a different hash!

alexanderdean commented 8 years ago

Thanks @christoph-buente !

christoph-buente commented 8 years ago

Looking at the user agent i came to the suspicion that the actual implementation of the uuid algorithm falls back to Math.random() in certain browsers. Can this be confirmed? That would be a bug in the underlying uuid npm module, right?

bogaert commented 8 years ago

Hi @christoph-buente. This is an issue we have written about before: http://discourse.snowplowanalytics.com/t/de-deduplicating-events-in-hadoop-and-redshift-tutorial/248 (in case the solutions outlined there are useful).

christoph-buente commented 8 years ago

@bogaert thx Chris, i read that article. But i hoped there is a chance to fix that issue instead of working around it later down the data pipeline.

bogaert commented 8 years ago

@christoph-buente That makes sense. We'll look into it (cc @jbeemster).

christoph-buente commented 7 years ago

Bump 😄

bogaert commented 7 years ago

Hi @christoph-buente. It was good meeting in Berlin! This is something we'll keep investigating, but in the meantime we're also working on an improved deduplication step in the pipeline: https://github.com/snowplow/snowplow/issues?q=is%3Aopen+is%3Aissue+milestone%3A%22R8x+%5BHAD%5D+Synthetic+dedupe%22

alexanderdean commented 7 years ago

See https://github.com/snowplow/snowplow/issues/2967

alexanderdean commented 7 years ago

From @falschparker82:

Hi Snowplowers,

we've got heavy problems here due to some bots (especially Googlebot, googleweblight and others) apparently using a broken version of Math.random(), leading to duplicated event_ids. This gets especially annoying with the resulting huge cartesian joins resulting from significant context use.

I could track down the problem down into the uuid library of the implementation that the Javascript collector uses: https://github.com/kelektiv/node-uuid/blob/master/lib/rng-browser.js#L17 - which happens to fall eventually back down to Math.random() if no better randomness sources are available.

Unfortunately, the implementation of Math.random() is left to the JS interpreter, which is broken on several systems - sources:

http://stackoverflow.com/a/24224089/1281376 https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d#.keeelkt8v

I'd like to propose the following solution for this problem:

1) Replace Math.random() with a seedable Mersenne twister: https://github.com/pigulla/mersennetwister (probably needs upstream patching of npm uuid package). While this is not suitable for cryptographic uses, all that's needed is good entropy here to prevent event ID duplication. Also, the cycle length of >2^19000 far surpasses the capacity of a UUIDv4 (2^128).

1a) Make it possible to seed the RNG by invoking sp.js with an additional parameter &rnd_seed=a4b120f48... (if the page isn't cached or served via CDN, this should be possible rather easily with templating from multiple languages) - which then gets inserted by a seed vector into the twister.

1b) Alternatively - or as a fallback to 1a) - generate the seed vector from the following entropy sources:

Thoughts?

Right now I've got a little too much on my plate, so feel free to grab it - but if we happen to do ourselves eventually it we'd love to upstream in case you're interested.

alexanderdean commented 7 years ago

Scheduling into 2.8.0 as this is important

falschparker82 commented 7 years ago

re-hi!

(was a little confused at first since the issue link was cross-project, so #499 ended up linking to the wrong issue on snowplow/snowplow).

As said, so much for the thoughts, can't do it myself right now due to lack of time unfortunately. Just a quick clarification regarding:

I like 1b) - I have nothing against 1a) but I don't know of many (any?) users who don't serve sp.js via CDN...

I was actually thinking about templating the tag itself when rendering a web page, not the Javascript tracker. So more something along the lines of (thinking in PHP here for simplicity / demonstration purposes):

<script type="text/javascript">
    ;(function(p,l,o,w,i,n,g,r){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
    p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
    };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
    n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","https://static.justwatch.com/static/compile_jw/assets/sp-2.6.2.js","snowplow","<?php uniqid(); ?>"));
    </script>
falschparker82 commented 7 years ago

Addendum: I'd be happy to review though - a few implementation hints:

falschparker82 commented 7 years ago

Any update on this one? Can I help?

alexanderdean commented 7 years ago

Hey @falschparker82 - we pushed it back because we are assigning a new (to Snowplow trackers) engineer to the 2.8.0 release and this is a complex ticket. To bring it back into 2.8.0 again, I think we would need a PR from you...

falschparker82 commented 7 years ago

Unfortunately, we're super low on Frontend power right now and JavaScript isn't exactly my hometurf. I'll see if we can make a slot for that somewhere in the next weeks but no high hopes right now. Will touch base in a month at the latest.

christoph-buente commented 6 years ago

@alexanderdean do you have capacity now, to tackle that issue? It did not dissapear, unfortunately. 😭

alexanderdean commented 6 years ago

Potentially @christoph-buente - we should know within the next month how much bandwidth we have for JS Tracker work...

g13013 commented 6 years ago

Much needed, any news on this issue?

anagytherealone commented 6 years ago

For the record I've put a comment on the uuid package this repo uses to be able to create more secure id-s: kelektiv/node-uuid#173

alexanderdean commented 6 years ago

Thanks @anagytherealone !

smalory commented 4 years ago

Is there any more news on this issue? What's the plan? How are things progressing?

We've noticed the same thing- clashes in snowplow-generated ids. We've also seen clashes across ids: e.g. event_ids clashing with session_ids.

There seems to be a general feeling here that with a good enough random number generator you'll get an essentially unique UUID. While this is true when using a single stream of random numbers it is not true when you have several streams of random numbers not sharing state. Unfortunately, since there are many, many trackers running in parallel, any guarantees on the uniqueness of the ids falls over and is at the whim of how the seed is determined.

Bloating the codebase with a more sophisticated random number generator is overkill. Random number generators like the Mersenne Prime Twister are designed to be robust to a whole suite of statistical tests; the goal being that the generated random numbers will be used in simulations. For UUID, all that's needed is a long periodicity; even a basic random number generator would suffice.

In my opinion, the correct approach is to simply hash a whole bunch of context relating to the event, user, session etc. and also inject some randomness through a random number generator. Then, the only way the ids would be is if the random number clashed and the entirety of the context clashed. The chances of this are essentially 0.

paulboocock commented 4 years ago

Thanks for your comments @smalory This is certainly on our radar still. We have a considerable release on the horizon in regards to v3. We're currently in the process of triaging open issues and identifying what we want to think about changing for v3 and I think this issue might make the cut in some shape or form. Keep your eye on it and our milestones as we will be chopping and changing the milestones over the coming weeks as decide what our near term roadmap starts looking like.

smalory commented 4 years ago

Great; thanks for the info Paul!

patmmccann commented 4 years ago

I am investigating a similar issue. Has anyone tested if crypto.getRandomValues() works better for googleweblight or the googlebot? We also see an issue when the user agent is simply 'Mozilla 5.0' or contains adsbot

paulboocock commented 3 years ago

Time for an update on this one given v3 is now available. Unfortunately little has changed at the moment, the issue arises due to falling back to Math.random() when crypto.getRandomValues() isn't available. As we continued to support older browsers with v3, this fallback has stayed in place. Until we drop support for IE9 and IE10 then we can't do too much to fix this (unless anyone has any ideas that don't add too much bloat?).

As for dropping IE9 and IE10, they have a depressingly high level of events (particularly IE9) in the data we assessed. I'd hope we can drop them when v3 is mature and move to v4 - those that want IE9 and IE10 support can then stay on the mature v3.

For the time being, deduplicating the events downstream is the best solution for this (as described earlier in this issue).

schmkr commented 1 year ago

Hi @paulboocock,

In light of your last comment from almost 1.5 years ago, is there nowadays still a need/requirement to support IE9 and IE10 (and even IE11)?

Are there already plans for the mentioned v4?

Alternatively, would it be an idea to offer the "older browsers" support as a separate NPM package that can be installed like the plugins when people have a need to support those older browsers?

AlexBenny commented 1 year ago

Hi @schmkr, thanks for raising this question.

Are there already plans for the mentioned v4?

Yes. We have plans for the v4. Indicatively we plan to release the v4 before the end of the year. That version will drop the support to IE10. Also we will take care to reduce the issues related to Math.random.

davidher-mann commented 1 year ago

Hi @AlexBenny, since Version 3.9 was just released: are you still sticking to the plan that the Math.random issue will be addressed in Version 4.0 ?