Develop and test an enableDarkSocial function

yalisassoon commented 8 years ago

Original issue was part of the Snowplow project i.e. was raised a very long time ago before the JS tracker was a standalone project.

The idea is to enable Snowplow users to track social shares the same way that sites like Buzzfeed do them, by:

Adding a element to a URL e.g. a url fragment or querystring parameter
Tracking the URL that is added to the fragment
When a user shares the URL (by any channel, but including social / email / IM / bookmark etc.) fetch the element that is added into its own derived context
At the data modeling step build out the graph of referrals described in the Buzzfeed post

The Snowplow pipeline has evolved significantly since the original issue was raised. My initial suggestion (but only a suggestion - let's iterate the approach in this ticket):

Use the existing page view ID. (We already have a unique ID generated with every page view) as a shareToUrlId
Have a JS process that checks if the current page is a shared page (i.e. there is already a suitably formatted shareToUrlId in the URL as a fragment or name/value pair on the querystring)
If there is not, then add one to the page URL using the HTML5 pushState API and track an addShareIdToUrl event with the page view ID
If there is a URL, then fire a foundShareIdOnUrl event with the relevant ID captured from the URL

In addition we would then have a separate enrichment process that fetches the ID from the URL and loads it into a derived context.

Finally we'd have a step that ran as part of the data modeling that built the sharing graph.

Questions / issues:

How should we modify the URL (querystring or fragment?) Why has Buzzfeed gone with a fragment?
Should we be tracking addShareIdToUrl and readShareIdFromUrl as discrete events? I think this is a good idea, because the alternative - pulling out the ID as an enrichment - and then inferring this as part of the data modeling step - is more fragile: you assume that the ID is appended where it matches the page view ID and is a shared URL otherwise. But maybe that's OK?

cc @fblundun @alexanderdean @richardfergie @kingo55 @msmallcombe

fblundun commented 8 years ago

Adding it to the fragment is problematic since it may interfere with an existing fragment. For example, in the URL

https://github.com/snowplow/snowplow/wiki/Configuring%20the%20Clojure%20collector#enable-connection-draining-for-your-elastic-beanstalk-instance

any change to the fragment will prevent the browser from jumping to the enable-connection-draining-for-your-elastic-beanstalk-instance element.

Using the querystring is less likely to cause this sort of problem.

alexanderdean commented 8 years ago

It's great to have this ticket back in the frame! A few observations on how Buzzfeed does it, as the articles on Pound are somewhat vague:

Buzzfeed uses a short hash to identify each sharing node in the tree. This is deliberately a very short string (much shorter than a UUID) to reduce the chances of it being clipped/truncated during dark social sharing
When you click on this URI: http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue, on landing on the page you have the URI rewritten to include the hash, e.g. http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue#.vaO5pjMGwM
When you then share your URI (including hash) on a social site and somebody clicks on it, on landing on the page they in turn will have the URI rewritten to include a new hash, to indicate a new node in the sharing tree, e.g http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue#.topV28ADPg

Coming up with a hash which is densely packed enough (or can be associated with some other metadata e.g. IP address) to minimize collisions but brief enough to avoid truncation is an interesting challenge...

fblundun commented 8 years ago

It looks like Buzzfeed are using 9 random characters, each of which can be either a digit or an uppercase or lowercase letter. So there are a total of about 1.4 * 10 ^ 16 strings, so after generating 170,000,000 (i.e. sqrt((62 ** 9) * 2)) such strings you would expect to have about 1 random collision.

fblundun commented 8 years ago

See Modifying a querystring without reloading the page.

Unfortunately this technique looks like it doesn't work for older browsers.

alexanderdean commented 8 years ago

Of course - you can also do the join using the full URI, not just the hash, so that is plenty of entropy...

msmallcombe commented 8 years ago

Here are a couple of articles from Buzzfeed on their solution and what they are able to do with the data collected: http://www.buzzfeed.com/daozers/introducing-pound-process-for-optimizing-and-understanding-n#.arK1yq2by http://www.slideshare.net/g33ktalk/dataengconf-the-science-of-virality-at-buzzfeed

alexanderdean commented 8 years ago

I love the tracking hash in the first link!

msmallcombe commented 8 years ago

It is worth noting that Buzzfeed used to have their tracking hash on all URLs, including their home page. Recently they made a change to only have it on their content pages.

snowplow / snowplow-javascript-tracker

Develop and test an enableDarkSocial function #446