Open yalisassoon opened 8 years ago
Adding it to the fragment is problematic since it may interfere with an existing fragment. For example, in the URL
https://github.com/snowplow/snowplow/wiki/Configuring%20the%20Clojure%20collector#enable-connection-draining-for-your-elastic-beanstalk-instance
any change to the fragment will prevent the browser from jumping to the enable-connection-draining-for-your-elastic-beanstalk-instance
element.
Using the querystring is less likely to cause this sort of problem.
It's great to have this ticket back in the frame! A few observations on how Buzzfeed does it, as the articles on Pound are somewhat vague:
http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue
, on landing on the page you have the URI rewritten to include the hash, e.g. http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue#.vaO5pjMGwM
http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue#.topV28ADPg
Coming up with a hash which is densely packed enough (or can be associated with some other metadata e.g. IP address) to minimize collisions but brief enough to avoid truncation is an interesting challenge...
It looks like Buzzfeed are using 9 random characters, each of which can be either a digit or an uppercase or lowercase letter. So there are a total of about 1.4 * 10 ^ 16 strings, so after generating 170,000,000 (i.e. sqrt((62 ** 9) * 2)
) such strings you would expect to have about 1 random collision.
See Modifying a querystring without reloading the page.
Unfortunately this technique looks like it doesn't work for older browsers.
Of course - you can also do the join using the full URI, not just the hash, so that is plenty of entropy...
Here are a couple of articles from Buzzfeed on their solution and what they are able to do with the data collected: http://www.buzzfeed.com/daozers/introducing-pound-process-for-optimizing-and-understanding-n#.arK1yq2by http://www.slideshare.net/g33ktalk/dataengconf-the-science-of-virality-at-buzzfeed
I love the tracking hash in the first link!
It is worth noting that Buzzfeed used to have their tracking hash on all URLs, including their home page. Recently they made a change to only have it on their content pages.
Original issue was part of the Snowplow project i.e. was raised a very long time ago before the JS tracker was a standalone project.
The idea is to enable Snowplow users to track social shares the same way that sites like Buzzfeed do them, by:
The Snowplow pipeline has evolved significantly since the original issue was raised. My initial suggestion (but only a suggestion - let's iterate the approach in this ticket):
shareToUrlId
shareToUrlId
in the URL as a fragment or name/value pair on the querystring)addShareIdToUrl
event with the page view IDfoundShareIdOnUrl
event with the relevant ID captured from the URLIn addition we would then have a separate enrichment process that fetches the ID from the URL and loads it into a derived context.
Finally we'd have a step that ran as part of the data modeling that built the sharing graph.
Questions / issues:
addShareIdToUrl
andreadShareIdFromUrl
as discrete events? I think this is a good idea, because the alternative - pulling out the ID as an enrichment - and then inferring this as part of the data modeling step - is more fragile: you assume that the ID is appended where it matches the page view ID and is a shared URL otherwise. But maybe that's OK?cc @fblundun @alexanderdean @richardfergie @kingo55 @msmallcombe