Open amyxzhang opened 8 years ago
i think cutting off the # is the right thing to do for all sites (that are obeying proper http/html semantics)---it identifies a place in the page. More complicated is what comes after the ? , a lot of which is tracking data unique to the user which all refers to the same page.
On 02/24/2016 04:01 PM, Amy Zhang wrote:
I've noticed it for medium and for buzzfeed - need to cut off anything after the # so that visits to the same page don't get counted as different pages. Need to make this part of cron or a check when ingesting visits.
There are other domains with other rules as well. Note them here once I see them.
— Reply to this email directly or view it on GitHub https://github.com/haystack/eyebrowse-server/issues/155.
The ? is hard, because sometimes it helps direct to a specific page. For one example, the DL ACM uses ? to specify which paper you're looking at, which is pretty important to differentiate: http://dl.acm.org/citation.cfm?id=309253
I've noticed it for medium and for buzzfeed - need to cut off anything after the # so that visits to the same page don't get counted as different pages. Need to make this part of cron or a check when ingesting visits.
There are other domains with other rules as well. Note them here once I see them.