Closed ferristocrat closed 11 months ago
I think we should use eventkit for this.
From sprint planning: we can do this right now by using the uri that we currently still send (will stop in the next release) from linksharing on object downloads for filename-based uniqueness (we can do name+size for better uniqueness). After the next release, we should probably switch to linksharing calculating the hash of the object being downloaded and sending it to eventkit.
FYI - All uplink clients report a path checksum through eventkit, so we do have a unique id that is path specific. That isn't going away next release to my knowledge.
In terms of hashing the content of the data and then preserving it with eventkit, that feels like a pretty sizable privacy violation. I'm okay if we check hashes with virustotal for certain linksharing uses and then throw those hashes away, but I don't think we want to preserve the content hash for all possible accesses in something durable like BigQuery. We've been pretty careful not to keep that type of data anywhere.
If we do want to hash the content, I would recommend salting it in some way with the project salt, or the combination of the project salt, bucket name, and encrypted path.
The reason for this is we don't want the same content across multiple different user accounts to be comparable, otherwise all I need to do to find out who has a specific file is to upload it myself and then breach the database that has the assignment of who else has it.
However, if we don't have cross project comparability (I don't think we should? or at least, not that we preserve anywhere, even in analytics), then I'm not sure content-based hashes gain us much beyond what a path based hash would do.
Is the path checksum sufficient?
In terms of hashing the content of the data and then preserving it with eventkit, that feels like a pretty sizable privacy violation. I'm okay if we check hashes with virustotal for certain linksharing uses and then throw those hashes away, but I don't think we want to preserve the content hash for all possible accesses in something durable like BigQuery. We've been pretty careful not to keep that type of data anywhere.
I agree. I think the path checksum is sufficient for an estimation. I don't think we need the precise value.
BTW I didn't know we had the path checksum (let's use it instead!). The thing that's gone away in the release is linksharing reporting URIs.
Goal
Acceptance Criteria
-How many unique files are downloaded in a given timeframe?
Links