milesmcc / shynet

Modern, privacy-friendly, and detailed web analytics that works without cookies or JS.
Apache License 2.0
2.89k stars 183 forks source link

Option to drop fragments from URLs #215

Open haplo opened 2 years ago

haplo commented 2 years ago

I would like for Shynet to have a configuration option that will make it drop fragments (i.e. everything after #...) from URLs. This way URLs that are really the same page will be collated together for stats.

For example in my blog I have people linking to https://blog.fidelramos.net/photography/photography-workflow#5-replication-with-syncthing, but I would like for Shynet to treat that link as https://blog.fidelramos.net/photography/photography-workflow so all those hits are grouped together.

One thing I'm not sure about is whether the URL sanitation should happen at collection time or when calculating the stats, I'm not familiar with shynet's internals yet to know which approach makes the most sense. On one hand it would be better to collect raw data without alteration, but this might put too heavy a burden when parsing the stats.

Another big question is where to offer this option. As a per-site configuration checkbox? As filters in the dashboard?

I'm willing to code this but would like some discussion and agreement on how to execute it so the effort is productive.

milesmcc commented 2 years ago

Thanks for checking in before making the change.

I think this would probably make the most sense as a per-site configuration checkbox — not as filters in the dashboard. As a result, the processing would be done at ingest time, rather than at query time.

I agree that this is less than ideal, since we'd like to collect as "raw" data as possible. But perhaps we should also think of this as a security improvement. Some websites — wrongly! — include sensitive data inside the URL fragment (e.g., auth tokens). This change allows them to use Shynet without sensitive information hitting the database.

I say go ahead and create a PR. Thanks!

haplo commented 2 years ago

Thank you for the quick response @milesmcc. :raised_hands:

On one hand I appreciate that Shynet is simple, but on the other I'm concerned of losing the original data. What do you think if we add a Hit.raw_location field that is populated only when the location is parsed and filtered, like in this case? Overkill or worth it?

milesmcc commented 2 years ago

I think it's maybe overkill. I worry it also might make things like querying and filtering more complicated (since there would be two table fields that end users might want to interact with). It also removes any potential security or privacy benefit.